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1.  Introduction 

In  this  paper  we  discuss  some  methods  of  factor  analysis.  The  entire  discussion  is  cen¬ 
tered  around  one  general  probability  model.  We  consider  some  mathematical  problems 
of  the  model,  such  as  whether  certain  kinds  of  observed  data  determine  the  model  unique¬ 
ly.  We  treat  the  statistical  problems  of  estimation  and  tests  of  certain  hypotheses.  For 
these  purposes  the  asymptotic  distribution  theory  of  some  statistics  is  treated. 

The  primary  aim  of  this  paper  is  to  give  a  unified  exposition  of  this  part  of  factor  anal¬ 
ysis  from  the  viewpoint  of  the  mathematical  statistician.  The  literature  on  factor  analysis 
is  scattered;  moreover,  the  many  papers  and  books  have  been  written  from  many  differ¬ 
ent  points  of  view.  By  confining  ourselves  to  one  model  and  by  emphasizing  statistical 
inferences  for  this  model  we  hope  to  present  a  clear  picture  to  the  statistician. 

The  development  given  here  is  expected  to  point  up  features  of  model-building  and 
statistical  inference  that  occur  in  other  areas  where  statistical  theories  are  being  de¬ 
veloped.  For  example,  nearly  all  of  the  problems  met  in  factor  analysis  are  met  in  latent 
structure  analysis. 

There  are  also  some  new  results  given  in  this  paper.  The  proofs  of  these  are  mainly 
given  in  a  technical  Part  II  of  the  paper. 

In  confining  .ourselves  to  the  mathematical  and  statistical  aspects  of  one  model,  we 
are  leaving  out  of  consideration  many  important  and  interesting  topics.  We  shall  not  con¬ 
sider  how  useful  this  model  may  be  nor  in  what  substantive  areas  one  may  expect  to  find 
data  (and  problems)  that  fit  the  model.  We  also  do  not  consider  methods  based  on  other 
models.  In  doing  this,  we  do  not  mean  to  imply  that  the  model  considered  here  is  the  most 
useful  or  important.  It  seems  that  this  model  has  some  usefulness  and  importance,  it 
has  been  studied  considerably,  and  one  can  give  a  fairly  unified  exposition  of  it. 

Extensive  discussion  of  the  purposes  and  applications  (as  well  as  other  developments) 
of  factor  analysis  is  given  in  books  by  psychologists  (for  example,  Holzinger  and  Har¬ 
mon  [10],  Thomson  [23],  Thurstone  [24]).  Some  general  discussion  of  statistical  inference 
has  been  given  in  papers  by  Bartlett  [9]  and  Kendall  [12]. 

PART  I.  EXPOSITORY 


2.  The  model 

The  model  we  consider  is 

(2.1)  X  =  A f+U+v, 
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where  X,  U,  and  n  are  column  vectors  of  p  components,  /  is  a  column  vector  of  M(^  p) 
components,  and  A  is  a  p  X  m  matrix.  We  assume  that  U  is  distributed  independently  of  / 
and  with  mean  £U  =  0  and  covariance  matrix  £UU'  =  2,  which  is  diagonal.  The  vec¬ 
tor/  will  in  some  cases  be  treated  as  a  random  vector,  and  in  other  cases  will  be  treated 
as  a  vector  of  parameters  which  varies  from  observation  to  observation.  The  vector  X 
constitutes  the  observable  quantities. 

The  most  familiar  interpretation  of  this  model  is  in  terms  of  mental  tests.  Each 
component  of  X  is  a  score  on  a  test  or  battery  of  tests.  The  corresponding  com¬ 
ponent  of  n  is  the  average  score  of  this  test  in  the  population.  The  components  of  /  are 
the  mental  factors;  linear  combinations  of  these  enter  into  the  test  scores.  The  coefficients 
of  these  linear  combinations  are  the  elements  of  A,  and  these  are  called  factor  loading. 
Sometimes  the  elements  of  /  are  called  common  factors  because  they  are  common  to 
several  different  tests;  in  the  first  presentation  of  this  kind  of  model  (Spearman  [20])  / 
consisted  of  one  component  and  was  termed  the  general  factor.  A  component  of  U  is  the 
part  of  the  test  score  not  “explained”  by  the  common  factors.  This  is  considered  as  made 
up  of  the  error  of  measurement  in  the  test  plus  a  specific  factor,  this  specific  factor  having 
to  do  only  with  this  particular  test.  Since  in  our  model  (with  one  set  of  observations  on 
each  individual)  we  cannot  distinguish  between  these  two  components  of  the  coordinate 
of  U  we  shall  simply  term  the  element  of  U  as  the  error  of  measurement. 

The  specification  of  a  given  component  of  X  is  similar  to  that  in  regression  theory 
(or  analysis  of  variance)  in  that  it  is  a  linear  combination  of  other  variables.  Here,  how¬ 
ever,/,  which  plays  the  role  of  the  independent  variable,  is  not  observed. 

We  can  distinguish  between  two  kinds  of  models.  In  one  we  consider  the  vector  /  to 
be  a  random  vector,  and  in  the  other  we  consider  /  to  be  a  vector  of  nonrandom  quanti¬ 
ties  which  varies  from  one  individual  to  another.  In  the  second  case  it  would  be  more  ac¬ 
curate  to  write  Xa  —  A fa  -f-  U  -f*  n.  In  the  former  case  one  sample  of  size  N  is  equiva¬ 
lent  to  any  other  sample  of  size  N.  In  the  latter  case,  however,  a  set  of  observations 
Xi,  ■  ■  •  ,  xn  is  not  equivalent  to  x^+i,  •  •  *  ,  X2N  because  fi,  '  ‘  '  ,/n  will  not  be  the  same 
as/jy+i,  •  •  •  ,/in  and  these  enter  as  parameters.  Another  way  of  looking  at  this  distinc¬ 
tion  is  that  in  the  latter  case  we  have  the  conditional  distribution  of  X  given  /.  The  dis¬ 
tinction  we  are  making  is  the  one  made  in  analysis  of  variance  models  (components  of 
variance  and  linear  hypothesis  models). 

When  /  is  taken  as  random  we  shall  assume  £/  =  0.  (Otherwise,  £X  =  A gj  +  /x» 
and  n  can  be  redefined  to  absorb  A £f.)  Let  £ff  =  M.  Our  analysis  will  be  made  entire¬ 
ly  in  terms  of  first  and  second  moments.  Usually,  we  shall  consider  /  and  U  to  have  nor¬ 
mal  distributions.  If  /  is  not  random,  then  /  =  fa  for  the  ath  individual.  Then  we  shall 
assume  usually 

l£/.  =  0  and  = 

There  is  a  fundamental  indeterminacy  in  this  mod6l.  Let  /  =  Af*  (f*  —  A~lf )  and 
A*  =  Aa4,  where  A  is  a  nonsingular  (m  X  m)  matrix.  Then  (2.1)  can  be  written  as 

(2.2)  X  =  A*f*+U+» 
where  here  (when  /  is  random) 

(2.3)  £/*/*'  =  A-lM(A~lY  =  M*  , 
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say.  If  /  is  normal  or  if  we  only  consider  second-order  moments,  the  model  with  A  and /  is 
equivalent  to  the  model  with  A*  and  /* ;  that  is,  by  observing  X  we  cannot  distinguish 
between  these  two  models. 

Some  of  the  indeterminacy  in  the  model  can  be  eliminated  by  requiring  that  £ff  =  I, 
(^or  fa fL  —  NI ,  if  /  is  not  random^ .  In  this  case  the  factors  are  said  to  be  orthog¬ 
onal;  if  M  is  not  diagonal,  the  factors  are  said  to  be  oblique.  When  we  assume  M  =  I, 
then  (2.3)  is  A^A^Y  —  I  (I  =  A  A').  The  indeterminacy  is  equivalent  to  multiplica¬ 
tion  by  an  orthogonal  matrix;  this  is  called  the  problem  of  rotation.  Requiring  that  M  be 
diagonal  means  that  the  components  of  /  are  independently  distributed  when  f  ^  as¬ 
sumed  normal.  This  has  an  appeal  to  psychologists  because  one  idea  of  common  mental 
factors  is  (by  definition)  that  they  are  independent  or  uncorrelated  quantities. 

A  crucial  assumption  is  that  the  components  of  U  are  uncorrelated.  Our  viewpoint  is 
that  the  errors  of  observation  ~and  the  specific  factors  are  by  definition  uncorrelated. 
That  is,  the  interrelationships  of  the  test  scores  are  caused  by  the  common  factors,  and 
that  is  what  we  want  to  investigate.  There  is  another  point  of  view  on  factor  analysis 
that  is  fundamentally  quite  different;  that  is,  that  the  common  factors  are  supposed  to 
explain  or  account  for  as  much  of  the  variance  of  the  test  scores  as  possible.  To  follow 
this  point  of  view,  we  should  use  a  different  model. 

At  this  point  we  perhaps  should  indicate  another  point  of  view  which  we  do  not  treat. 
That  is  that  mental  factors  are  positive  quantities;  any  individual  has  these  to  some  de¬ 
gree;  each  test  score  depends  on  these  in  a  positive  way.  This  implies  that  all  the  coeffi¬ 
cients  of  A  are  nonnegative.  This  point  of  view  leads  to  important  and  interesting  con¬ 
siderations.  However,  in  this  paper  we  shall  not  consider  this. 

As  in  all  problems  of  multivariate  statistics,  a  geometric  picture  helps  the  intuition. 
We  consider  a  ^-dimensional  space.  The  columns  of  A  can  be  considered  as  m  vectors  in 
this  space.  They  span  some  w-dimensional  subspace;  in  fact,  they  can  be  considered  as 
coordinate  axes  in  the  m-dimensional  space,  and  /  can  be  considered  as  coordinates  of  a 
point  in  that  space  referred  to  this  particular  axis-system.  This  subspace  is  called  the 
factor  space.  Multiplying  A  on  the  right  by  a  matrix  corresponds  to  taking  a  new  set  of 
coordinate  axes  in  the  factor  space.  ^  A*-  A  A  i*  a 


3.  The  problems 

We  now  list  the  considerations  which  must  be  made  for  this  model.  We  point  out  that 
exactly  the  same  considerations  enter  into  other  models,  for  example,  latent  structure 
analysis.  For  the  sake  of  outlining  these  problems  we  shall  assume  that  /  is  random  and 
is  formally  distributed  with  gff  =  I.  (m 

w'  (I)  Existence  of  the  model.  From  (2.1)  we  deduce  that  X  is  normally  distributed  with 
mean  n  and  covariance  matrix 

C(X  -  u)(x  -  n)'  =  £(A/+  u)(Af  +  uy 

=  £(A//'A'  +  Uf  A'  +  A fU'  +  UUf) 

=  A£//'A'  +  ZUU'  ' 

=  AA'  +  S 


(3.1) 
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say.  Suppose  we  have  some  normal  population  with  mean  u*  and  covariance  matrix  'k*, 
is  there  a  factor  analysis  model  that  can  generate  this  population?  Essentially,  this  is  a 
question  whether  the  equation  >k*  =  AA'  +  2  can  be  solved,  or  rather,  what  conditions 
must  \k*  satisfy  so  that  SI'*  =  AA'  +  2  can  be  solved.  Another  way  of  looking  at  this 
problem  leads  to  the  formulation  of  what  is  the  minimum  m  for  which  the  equation  can 
be  solved. 

(II)  Identification.  Suppose  there  is  some  A  and  2  such  that  \k*  =  AA'  +  2.  Does 
this  equation  then  have  a  unique  solution?  From  the  previous  discussion  it  is  clear  that 
the  above  equation  is  also  satisfied  by  Ad,  where  6  is  an  orthogonal  matrix.  We  can  con¬ 
sider  (i)  if  other  restrictions  are  placed  on  A,  is  the  solution  unique,  or  (ii)  are  A  and  2 
determined  uniquely  except  for  multiplication  of  A  on  the  right  by  an  orthogonal  matrix? 

(III)  Determination  of  the  structure.  Given  Sk  [and  suppose  that  (3.1)  can  be  solved 
uniquely],  how  do  we  determine  A  and  2? 

We  now  turn  to  the  statistical  problems. 

(IV)  Estimation  of  parameters.  A  sample  of  N  individuals  is  drawn,  and  from  these 
observations  we  wish  to  estimate  u,  2,  and  A.  It  is  assumed  that  (3.1)  can  be  solved 
uniquely  and  that  m  is  known.  One  would  like  to  know  the  properties  of  various  estima¬ 
tion  methods. 

(V)  Test  of  the  hypothesis  that  the  model  fits.  Here  we  suppose  that  m  is  given.  We  test 
the  hypothesis  that  £(X  —  fi)(X  —  uY  can  be  of  the  form  AA'  +  2. 

(VI)  Determination  of  the  number  of  factors.  In  many  situations  the  number  of  factors 
m  cannot  be  specified  in  advance  of  the  statistical  investigation.  In  these  cases,  the  in¬ 
vestigator  wants  to  use  as  few  factors  as  possible  to  “explain”  the  population.  On  what 
basis  should  he  decide  that  he  has  the  right  number  of  factors? 

(VII)  Other  tests  of  hypothesis.  There  are  various  hypotheses  about  the  parameters, 
particularly  about  A,  that  are  of  interest. 

(VIII)  Estimation  of  factor  scores.  We  want  to  make  statements  about  the/’s  of  our 
observed  X’s. 


4.  Problems  of  the  population:  Existence  of  the  structure  (I) 

If  /  and  U  are  normally  distributed,  the  model  postulates  that  the  vector  of  p  test 
scores  X  has  a  multivariate  normal  distribution  with  a  vector  of  means  u  and  a  covari¬ 
ance  matrix  'k  which  has  the  form 


(4.1) 


V- 


e<x  -  »)(x  -  /*)'  =  ew+  u)( a/+  u) 

=  AM  A'  +  2 


,1 


r 


y 


where  2  is  diagonal  and  positive  definite,  A  is  a  pX  m  matrix  with  m  specified,  and  M  is 
an  arbitrary  positive  definite  matrix  of  order  m.  In  this  case  the  problem  of  existence  of 
the  structure  is  the  problem  whether  the  distribution  of  a  vector  X  has  the  above  form. 
The  question  of  normality  will  not  be  considered  here;  the  vector" ofmeans^iTsliii- 
restricted  and  hence  is  of  mo  question.  The  essential  question  is  whether  the  covariance 
matrix  of  X  has  the  form  of  (4.1);  that  is,  given  the  pX  p  positive  definite  matrix  \k, 
can  it  be  expressed  asJS  -f-jVif  A/ (2  diagonal  and  A  of  size  p  X  m)7  If  /  is  not  normal, 
we  restrict  our  considerations  to  second-order  moments,  and  the  essential  problem  is 
the  same. 

As  far  as  our  present  problem  goes,  we  can  assume  that  M  =  /,  for  if  we  are  given  a 
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matrix  AM  A'  we  can  write  it  as  A*A*'  by  letting  A*  =  AA,  where  A  is  a  matrix  such  that 
A  A'  =  M.  Thus  we  ask  if  there  isa  A  and  2  sucBT  that 

(4.2)  *  =  2  +  AA' .  \/ 

One  way  of  determining  whether  'k  can  be  expressed  in  the  desired  form  is  to  set  about 
solving  the  equations 


(4.3) 


ta  =  <ru+  2  **a  »  and  tv  =  S  x<a^a’ i<  j • 


These  are  polynomial  equations,  and  there  are  well-known  methods  for  solving  them. 
If  there  is  an  algebraic  solution,  one  must  ascertain  that  Xto  are  real  and  <rtt  are  real  and 
nonnegative. 

The  algebraic  solution  is  laborious  and  gives  little  insight.  What  we  want  are  condi¬ 
tions  on  Ar  that  can  be  applied  more  directly. 

A  good  deal  of  insight  can  be  obtained  by  comparing  the  number  of  equations  with  the 
number  of  unknowns  [25].  In  4r  there  are  p(p  +  l)/2  elements,  and  this  is  the  number  of 
equations  involving,  the  unknowns  on  and  X»a.  There  are  p  elements  of  the  diagonal  2,  and 
there  are  pm  elements  of  A.  However,  in  any  solution  A  can  be  replaced  by  A 0,  where  0  is 
an  orthogonal  (m  X  m)  matrix,  and  8  has  m(m  —  l)/2  independent  elements;  that  is,  in 
any  solution,  A  can  be  made  to  satisfy  m(m  —  l)/2  additional  conditions.  Thus  the  num¬ 
ber  of  equations  and  conditions  minus  the  number  of  unknowns  to  be  determined  is 


(4.4) 


/>(/>+!)  m(m—  1) 
2  '  2 

_  (p  —  m)2  —  p  —  m 


p  —  pm 


It  can  be  expected  that  if  C  ^  0,  then  an  algebraic  solution  to  the  equations  is  possible. 
If  C  >  0,  one  can  expect  that  no  solution  is  possible;  in  this  case  it  appears  that  4'  must 
satisfy  some  C  conditions  for  a  solution  to  be  possible.  The  inequality  C  ^  0  can  also  be 
written 


(4.5) 


m 


2/>+l  —  V8^+l_ 

2  P 


y/  8  p  -f- 1  —  1 


Some  values  of  p  and  [2p  +  1  —  +  l]/2  are 


J% 

2/>+l— V8^4-l 

p 

2 

1 

0 

3 

1 

5 

2.3 

6 

3 

8 

4.5 

9 

5.2 

10 

6 

12 

7.6 

13 

8.4 

14 

9.2 

15 

10 
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This  counting  of  equations  and  unknowns  gives  us  a  rough  criterion  of  solvability;  it 
does  not,  of  course,  lead  to  precise  necessary  and  sufficient  conditions  for  solvability, 
^or^one  thing  we  cannot  be  sure  that  the  equations  are  independent;  another  difficulty 
ia  that  the  solution  may  not  be  real  or  yield  nonnegative  on. 

It  is  well  known  that  a  necessary  and  sufficient  condition  that  a  PX  P  matrix  A  can 
be  expressed  as  BB',  where  B  is  p  X  m,  is  that  A  be  positive  semidefinite  of  rank  m. 
Thus  we  can  state  the  following. 

Theorem  4.1.  A  necessary  and  sufficient  condition  that  4r  he  a  covariance  matrix  of  a 
factor  analysis  model  with  m  factors  is  that  there  exist  a  diagonal  matrix  2*  with  nonnegative 
elements  such  that  \k  —  2  *  is  positive  semidefinite  of  rank  m. 

Now  the  question  is  how  we  can  tell  whether  there  exists  such  a  diagonal  matrix  2*. 
It  is  instructive  to  consider  the  case  m  =  1.  Then  we  can  expect  that  has  to  satisfy 
C  =  p(p  —  l)/2  —  p  conditions  of  equality  as  well  as  some  inequalities.  In  this  case  A 
is  a  column  vector  and  AA'  is  a  positive  semidefinite  matrix  of  rank  one.  The  question  is 
whether  we  can  subtract  nonnegative  numbers  from  the  diagonal  elements  of  to  give  a 
positive  definite  matrix  of  rank  one.  —  2  will  be  of  rank  one  if  and  only  if  2  can  be  chosen 
so  that  all  second-order  minors  are  zero.  A  second-order  minor  which  does  not  include  a 
diagonal  element  is  known  as  a  tetrad  and  has  the  form 

(4.6)  I  ^ hl  ^.h}  I  =  PhiPkj  ~  ( h ,  i,  j,  k  different) . 

I  Yki  Ykj  I 

These  must  all  be  zero.  A  second-order  minor  which  includes  one  diagonal  element  has 
the  form 

(4.7)  *"|  =  (*«-»«)*« -*«*«(<.  j,  *  different). 

Setting  this  equal  to  zero,  shows  <r„  must  be  chosen  so 

(4.8)  <ru  =  pu  —  - ('I'kjT*  0)  . 

Ykj 

The  conditions  that  the  solution  be  consistent  (that  is,  independent  of  the  pair  j,  k)  are 
the  tetrad  conditions.  Moreover,  these  conditions  insure  that  second-order  minors  con¬ 
taining  two  diagonal  elements  are  zero.  It  can  be  shown  that  p(p  —  l)/2  —  p  of  the 
tetrad  conditions  imply  \pn  =  q^j  (i  j ),  and  this  in  turn  implies  the  tetrad  conditions 
for  all  i,j,  k,  h  (all  different). 

If  the  tetrad  conditions  are  satisfied,  then  —  2  will  have  rank  one.  If  this  matrix  is 
to  be  positive  semidefinite,  the  diagonal  elements  must  be  nonnegative;  that  is,  pki't'u/ 
pk}-  ^  0.  If  2  is  to  be  positive  semidefinite,  on  ^  0. 

Theorem  4.2.  A  necessary  and  sufficient  condition  that  be  a  covariance  matrix  of  a 
factor  analysis  model  with  one  factor  is  that  p(p  —  l)/2  —  p  independent  tetrad  condi¬ 
tions  are  satisfied  and 

(4.9)  0 
for  one  pair  ( j  5*  k)  for  each  i. 

Another  way  of  expressing  the  condition  pkipn/pkj  ^  0  is  to  ask  whether  one  can 
multiply  some  rows  and  corresponding  columns  by  —  1  to  obtain  a  matrix  with  all  non¬ 
negative  elements. 
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The  case  of  one  factor  is  of  particular  interest.  In  fact,  the  original  theory  of  Spearman 
was  given  for  one  “general”  factor. 

A  similar  analysis  can  be  made  for  the  case  m  =  2.  However,  the  conditions  become 
more  complicated  (see  [26]). 

In  section  8  we  shall  consider  the  question  of  determining  from  the  sample  whether  a 
factor  analysis  model  with  a  given  number  of  factors  is  adequate  to  “explain”  the  situa¬ 
tion.  The  study  of  the  problem  of  solvability  in  the  population  is  of  importance  for  the 
insight  it  gives  into  the  model  and  for  suggestions  of  how  to  use  the  sample  to  ascertain 
whether  the  model  is  suitable. 

Albert  [1]  has  given  a  theorem  that  leads  to  a  direct  procedure  for  determining  whether 
*  —  2  is  of  rank  m.  (The  procedure  does  not  verify  whether  SI'  —  2  is  positive  definite.) 
Suppose  that  m  is  the  maximum  rank  of  the  submatrices  of  *  that  do  not  include  diago¬ 
nal  elements.  Then  the  rows  and  columns  of  *  can  be  numbered  so 


'*11 

*12 

*  13' 

(4.10) 

*  = 

*21 

*22 

*23 

*31 

*32 

*33' 

where  *n,  *12  =  Sk ,  and  *22  are  square  submatrices  of  order  m  and  *,2  is  nonsingular. 
Then  Sk  —  2  is  of  rank  m  if 


(4.11) 


*12  —  (Sku  —  2i)  *21*  (*22 —  22)  >  *13  —  Oku  —  2i)  *2,**23 

Sk32  =  (*22  —  22)  ,  Sk33  —  23  =  *31*2, **23  • 


Albert  [2]  has  further  shown  that  if  *31  and  *32  are  also  of  rank  m,  then  there  is  a  unique¬ 
ly  determined  2  such  that  *  —  2  is  of  rank  m. 

6.  Problems  of  the  population:  Identification  (II) 

Here  we  assume  that  there  is  at  least  one  solution  to  *  =  2  +  AA',  and  we  ask 
whether  there  is  more  than  one  solution.  More  precisely,  we  assume  that  there  is  at  least 
one  solution  satisfying  some  conditions  and  ask  whether  there  is  more  than  one  solution 
satisfying  these  conditions.  Since  any  solution  2,  A  can  be  replaced  by  2,  Ad,  where  9  is 
orthogonal,  it  is  clear  that  if  we  are  to  have  a  unique  solution,  some  additional  conditions 
must  be  put  on  A  and  2. 

We  can  distinguish  between  two  kinds  of  sets  of  restrictions.  A  set  of  one  kind  will  not 
affect  AA',  while  a  set  of  the  other  kind  may  limit  AA'.  A  set  of  restrictions  of  the  first 
kind  is  essentially  a  mathematical  convenience,  for  any  solution  2,  A  gives  a  whole 
class  of  solutions  2,  Ad  and  a  set  of  restrictions  of  the  first  kind  simply  picks  out  of  Ad  a 
representative  solution.  It  is  fairly  clear  how  we  can  go  from  the  class  of  solutions  to  the 
representative  one  and  how  we  can  generate  the  class  from  the  representative  solution. 

In  section  4  we  noted  that  there  are  p(p  +  l)/2  elements  of  *,  p  elements  of  2,  pm 
elements  of  A  and  m{m  —  l)/2  independent  elements  of  d.  We  can  expect  that 
m(m  —  l)/2  restrictions  will  be  needed  to  eliminate  the  indeterminacy  due  to  d.  If 
C  =  \[{p  —  m)7  —  p  —  m]  is  nonnegative  we  can  then  expect  identification.  If  C  is 
negative,  we  can  expect  that  —  C  additional  restrictions  are  necessary  for  identification; 
in  this  case  there  should  be  in  all  — C  +  m(m  —  l)/2  =  p  +  pm  —  p(p  +  l)/2. 

This  counting  of  equations  is,  of  course,  inadequate  for  making  precise  statements 
about  identification.  We  shall  now  investigate  the  problem  more  adequately.  It  is  possible 
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to  consider  conditions  on  \k  that  imply  identification  (that  is,  unique  solvability)  just  as 
in  the  previous  section  we  considered  conditions  on  Sk  for  solvability.  However,  it  is  more 
convenient  to  consider  conditions  involving  2  and  A  for  the  one  assumed  solution.  We 
shall  first  consider  conditions  assuming  that  2  and  AA'  are  determined  uniquely. 
Lemma  5.1.  If 

(5.1)  LI!  =  AA' 

■where  A  and  L  are  pX  m  and  A  is  of  rank  m,  then  L  =  AO,  where  6  is  orthogonal.  ^ 

Proof.  The  lemma  is  well  known,  but  we  give  a  proof  for  the  sake  of  completeness; 
methods  for  finding  L  subject  to  certain  restrictions  are  given  in  section  6.  Since  A  is  of 
rank  m ,  AA'  is  of  rank  m  and  L  must  be  of  rank  m.  Multiply  (5.1)  on  the  right  by  L{L'L)~l 
to  obtain  L  =  AB,  where  B  =  A'L(L'L)~l.  Multiplication  on  the  left  by  {L'L)~lL'  shows 
I  =  B'B.  Q.E.D. 

Theorem  5.1.  A  sufficient  condition  for  identification  of  2  and  A  up  to  multiplication 
on  the  right  by  an  orthogonal  matrix  is  that  if  any  row  of  A  is  deleted  there  remain  two  dis¬ 
joint  submatrices  of  rank  m. 

Proof.  Let  Ar  =  2  4*  AA'.  To  prove  the  theorem  we  shall  now  show  that  if  At  = 
S  +  LL',  where  S  is  diagonal  and  L  is  p  X  m,  then  5=2  and  LL!  =  AA'.  Since  the 
off-diagonal  elements  of  AA!  and  of  LL’  are  the  corresponding  off-diagonal  elements  of 
4',  we  only  have  to  show  that  the  diagonal  elements  of  LL'  are  equal  to  the  diagonal  ele¬ 
ments  of  AA'. 

The  condition  implies  that  2m  +’  1  p.  Let 


(5.2) 


'Ai  ' 

Xx  ' 

^TO  +  l 

L  = 

lm+ 1 

a2 

L2 

»A3 

X3  y 

where  Ai  and  A2  are  nonsingular,  and  Am+i  is  the  (m  -f-  l)st  row;  L  is  partitioned  in  sub¬ 
matrices  of  the  same  number  of  rows.  Then 


'AxAi 

AiXm+x 

AxA2 

A1A3 

^m+lAx 

^m  +  lA2 

bm+  1A3 

A2A'i 

A2X'm+x 

A2A2 

a2a^ 

•A3A'i 

AzA'm+i 

a3a^ 

A3A3  / 

and  LL'  has  the  same  form.  Since  Ai\l+i,  Xm+iA2  and  A1A2  are  off-diagonal,  AiX^+1  = 
LiC+i,  Xm+iA^  =  lm+iL'2,  and  AiA2  =  L\L'2,  which  is  nonsingular  (since  Ai  and  A2  are  non¬ 
singular).  Since  LL'  is  of  rank  m 


(5.4) 


Lffm+l 

LxL'z 

AiX'm+i 

A\Az 

Im+Jm+l 

Im+lL'i 

lm+llm+1 

Xm+iA2 

—  (  —  1)  mlm  +  il'm+l  I  A1A2  I  -j-  /  (A)  . 


Similarly,  0  =  (—  l^^+iX^+i  |  A1A2 1  +  /(A).  Since  |AiA2|  ^  0,  lm+ 1C+1  =  WX+i- 
In  the  same  fashion,  we  show  that  the  other  diagonal  elements  of  LL'  are  equal  to  those 
of  AA'.  This  proof  is  patterned  after  Albert  [1]. 

We  can  give  a  geometric  interpretation  of  this  condition.  The  columns  of  A  are  vec- 
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tors  in  />-space;  the  columns  of  A  after  a  row  is  deleted  are  the  projections  of  the  vectors 
on  the  space  of  p  —  1  coordinate  axes.  We  require  that  the  projection  of  these  vectors  on 
two  different  spaces  of  m  coordinate  axes  span  the  two  spaces. 

It  is  fairly  clear  that  the  condition  is  unnecessarily  strong  in  general.  After  one  com- 
munality,  that  is,  diagonal  element  of  LL' ,  is  determined,  it  can  be  used  in  determin¬ 
ing  another.  Moreover,  the  condition  that  2m  +  1  ^  p  is  much  stronger  than  that  C  ^  0. 
Wilson  and  Worcester  [27]  have  given  an  example  of  p  —  6  and  m  =  3  where  one  and 
only  one  solution  exists. 

We  now  give  some  theorems  that  include  necessary  conditions  for  identification.  It 
wjll  be  assumed  now  that  2  is  positive  definite. 

Theorem  5.2.  Let  Cm{ A)  be  a  condition  on  A  that  is  necessary  for  identification.  Then 
Cm(A6)  for  any  orthogonal  0  is  also  a  necessary  condition  for  identification. 

Proof.  If  Cm( A)  is  not  true,  there  is  an  5  and  an  L  such  that 

(5.5)  AA'  +  2  =  LL'  +  5 

and  AA'  7^  LL' .  If  Cm(A0 )  is  not  true,  then  there  is  an  S*  and  L*  such  that 

(5.6)  (A0)(A0)'  +  2  =  L*L*'  +  S* 

and  A0(A0)'  5^  L*L*',  but  the  equation  implies  AA'  +  2  =  L*L*'  +  S*  and 
AA'  5^  L*L*'. 

Theorem  5.3.  Let  Cm( A)  be  a  condition  on  A  that  is  necessary  for  identification.  Let  A* 
be  a  submatrix  formed  by  taking  m*  columns  of  A.  Then  C„(A*)  is  a  necessary  condition 
for  identification. 

Proof.  Let  the  columns  of  A  be  arranged  so  that  A  =  (A*A**).  If  C„(A*)  is  not  true, 
there  is  an  5  and  an  L*  such  that 


(5.7)  A*A*'  +  2  =  L*L*'  +  S 

and  A* A*'  ^  L*L*'.  Then  (5.5)  is  satisfied  for  L  =  (L* A**)  and  AA'  =  A*A*'  + 

A**A**/  i*L*>  A**A**/  =  ijj 

Theorem  5.4.  Let  Cm,p(  A)  be  a  condition  on  A  that  is  necessary  for  identification.  Let  A* 
be  the  matrix  derived  from  A  by  deleting  the  rows  that  have  only  zero  elements.  Then  Cm,p*(  A*) 
is  a  necessary  condition  for  identification. 

Proof.  Let  the  rows  be  numbered  so 


<*•»  -(O’  -(0*  0) 

Then  ^  =  AA'  +  2  becomes 


-(o' 


0  ) 


(5.9)  4>*  =  A*A*'  -f  2*  , 

(5.10)  ***  =2**, 


and  only  the  first  involves  A*  and  2*. 

Lemma  5.2.  If  p  =  2  and  m  —  1,  A  A'  and  2  are  not  identified. 
Proof.  In  this  case 


/t  11 


(Til  "H  All 
A21X11 


An  A21 

C22  4"  A21 


)■ 


(5.11) 
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If  one  component  of  A  is  0,  say  X21,  then  =  ^21  =  0  and  ^22  =  <722.  Then  we  can  take 
S22  =  ^22,  hi  =  0,  and  Su  and  hi  as  any  numbers  satisfying  $n  +  l2n  =  \f/n  0).  If 
X21 5^  0,  then  ^12  5^  0.  Let  hi  be  any  number  so  that  ^ f2/^u  <  hi  <  ^22-  Then  we  take 
$22  =  ^22  —  lh,  hi  =  4'u/hi,  and  sn  —  tii  ~  hi  =  'I'ii  ~~  th/ hi' 

Theorem  5.5.  A  necessary  and  sufficient  condition  for  identification  if  m  —  1  is  that 
at  least  three  factor  loadings  he  nonzero. 

Proof.  The  necessity  follows  from  lemma  5.2  and  theorem  5.4;  the  sufficiency  is  a 
special  case  of  theorem  5.1. 

Theorem  5.6.  A  necessary  condition  for  identification  is  that  each  column  of  A/1  has 
at  least  three  nonzero  elements  for  every  nonsingular  A . 

Proof.  For  A  —  I,  the  result  follows  from  theorems  5.3  and  5.5.  Then  theorem  5.2 
implies  the  result  for  A  being  orthogonal.  If  A  is  not  orthogonal,  suppose  A/1  has  less 
than  three  nonzero  elements  in  the  i>th  column.  Then  the  same  will  be  true  for  an  orthog¬ 
onal  matrix  with  vth.  column  proportional  to  the  vth  column  of  A. 

Lemma  5.3.  If  p  —  \  and  m  =  2,  AA7  and  2  are  not  identified. 

Proof.  Let  the  rows  of  A  be  numbered  so  there  is  a  nonzero  element  in  the  first  row 
(theorem  5.6).  We  can  multiply  A  on  the  right  by  an  orthogonal  matrix  so  A  has  the  form 


(5.12) 


where  Xu  7*  0.  All  components  of  X*  are  nonzero  by  theorem  5.6.  We  shall  now  find  L  of 
the  form  of  A  so 


/(Txi+Xu  XuX*'  \  /S\ifi-lii  hilt' 

\  XnX*  2*  4"  X*  X* '  -p  X*  X*  V  \  hdi  S*  A~ l* l* '  A~  l* l* 


Let  /11  =  k\n,  where  k  >  1  and  $u  =  an  +  Xf x  —  l\x  =  an  +  (1  —  k2)\h  >  0.  Let 
It  =  (l/jfe)X?.  Then 

(5.14)  5*  +  It  It 7  =  2  *  +  X?  Xf '  +  X2*  X2* '  -  It  It 7 

=  2*+(l-i)xfxr'  +  X?A?' 

is  positive  definite.  If  1  —  1/jfe2  is  taken  small  enough,  the  nondiagonal  elements  of  the 
right-hand  side  of  (5.14)  have  the  same  signs  as  the  corresponding  elements  of  X|X*7. 
By  theorem  4.2  there  is  a  solution  of  (5.14)  for  5  and  l\. 

Theorem  5.7.  A  necessary  and  sufficient  condition  for  identification  if  m  =  2  is  that  if 
any  row  of  A  is  deleted,  the  remaining  rows  of  A  can  he  arranged  to  form  two  disjoint 
matrices  of  rank  2. 

Proof.  The  sufficiency  is  a  special  case  of  theorem  5.1.  To  prove  the  necessity  sup¬ 
pose  that  if  we  delete  the  first  row  of  A  there  are  not  two  remaining  disjoint  matrices  of 
rank  2.  Let  the  rows  of  A  be  arranged  so  A  can  be  partitioned  as 

V 

(5.15)  A  =  A2 

A3  > 

where  A2  is  2  X  2  and  of  rank  at  most  2,  and  A3  is  of  rank  1.  Since  A3  is  of  rank  1,  there 
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is  an  orthogonal  matrix  6  such  that  A3 6  =  {v  0),  where  v  and  0  are  vectors  of  p  —  3  com¬ 
ponents.  Let 

An  AiV 

A21  X2*2 

A31  X3*2 

v  0  ,  . 

By  theorem  5.6,  X22  ^  0  ^  X32.  After  deleting  the  first  row  of  A*,  we  can  get  two  sub¬ 
matrices  of  rank  2  only  in  the  form 


(5.17) 


The  assumption  that  there  are  not  two  such  matrices  of  rank  2  implies  that  Vi  =  0  ex¬ 
cept  for  at  most  one  index  i.  Then  theorem  5.4  and  lemma  5.3  imply  A*  (and  A)  is  not 
identified. 

Theorem  5.8.  A  necessary  condition  for  identification  is  that  for  each  pair  of  columns 
of  A  A  and  for  every  nonsingular  A  when  a  row  is  deleted ,  the  remaining  rows  of  this  two- 
column  matrix  can  he  arranged  to  form  two  disjoint  submatrices  of  rank  2. 

Proof.  This  follows  from  theorems  5.2,  5.3,  and  5.7. 

Now  let  us  consider  restrictions  that  eliminate  the  indeterminacy  of  rotation.  We 
might  note  in  passing  that  we  consider  A  and  A*  as  equivalent  if  each  column  of  A*  is 
obtained  by  multiplying  the  column  of  A  by  + 1,  for  replacing  a  column  of  A  by  its  nega¬ 
tive  is  only  equivalent  to  replacing  a  factor  score  by  its  negative.  Each  of  the  following 
set  of  restrictions  is  convenient  for  a  particular  method  of  solving  C  —  AA'  for  A  (sec¬ 
tion  6)  and  for  a  method  of  estimation. 

(a)  Triangular  matrix  of  0’s.  This  condition  is  that 


'  Xu 

0 

0 

.  .  .  0 

X2i 

x22 

0 

.  .  .  0 

(5.18) 

A  = 

Xml 

Xm2 

Am  3 

XOT 

(  Xpl 

Xp2 

Xp3 

Xp, 

that  is,  that  the  upper  square  matrix  is  triangular.  If  we  think  of  a  row  of  A  as  a  vector 
in  w-space,  the  condition  is  that  the  first  row  coincide  with  the  first  coordinate  axis,  the 
second  row  lie  in  the  plane  determined  by  the  first  two  coordinate  axes,  etc. 

(h)  General  triangularity  condition.  Let  B  be  a  given  pX  m  matrix  (of  rank  m).  Here 


we  require  that 

(5.19) 


0  0 

x  0 

X  X 


X  X 
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where  x  indicates  an  element  not  specified  zero.  It  is  seen  that  if  B'  =  (7  0),  then  we  ob¬ 
tain  condition  (a). 

(c)  Diagonality  of  A'A.  Here  we  require  that  A'A  be  diagonal  and  that  the  diagonal 
elements  of  A'A  be  different  and  arranged  in  descending  order.  Given  a  positive  definite 
matrix  A,  there  is  a  uniquely  determined  orthogonal  matrix  6  (except  for  multipli¬ 
cation  of  columns  by  —  1)  such  that  6' A8  is  diagonal  with  diagonal  elements  arranged  in 
descending  order  assuming  that  the  diagonal  elements  (which  are  the  characteristic  roots 
of  A)  are  different.  If  A  is  already  in  this  diagonal  form,  6  —  1. 

(< d )  Diagonality  of  A'2_1A.  Here  we  require  that  A'2-1A  be  diagonal  and  that  the  di¬ 
agonal  elements  be  different  and  arranged  in  descending  order.  Rao  [17]  has  related  this 
condition  to  canonical  correlation  analysis. 

The  conditions  given  above  are  more  or  less  arbitrary  ways  of  determining  the  factor 
loadings  uniquely.  They  do  not  correspond  to  any  theoretical  considerations  of  psychol¬ 
ogy;  there  is  no  inherent  meaning  on  them.  We  shall  now  consider  two  types  of  restric¬ 
tions  on  A  which  may  have  intrinsic  meaning;  these  conditions  may  also  restrict  AA'. 

/ Simple  structure.  These  are  conditions  proposed  by  Thurstone  for  choosing  a  matrix 
out  of  the  class  Ad  that  will  have  particular  psychological  meaning.  If  \ia  =  0,  then  the 
ath  factor  does  not  enter  into  the  ith  test.  The  general  idea  of  “simple  structure”  is  that 
many  tests  should  not  depend  on  all  the  factors  when  the  factors  have  real  psychological 
meaning.  This  suggests  that  given  a  A  one  should  consider  all-rotations,  that  is,  all  matri¬ 
ces  A 6,  where  6  is  orthogonal,  and  choose  the  one  giving  most  0  coefficients.  This  matrix 
can  be  considered  as  giving  the  simplest  structure  and  presumably  the  one  with  most 
meaningful  psychological  interpretation.  It  should  be  remembered  that  the  psychologist 
can  construct  his  tests  so  that  they  depend  on  the  factors  in  different  ways. 

If  we  do  not  require  £}f  =  I,  then 

(5.20)  £{X  -  u)(X  -  fj.)'  =  2  +  AM  A' , 

where  M  =  £}}'.  Then  A*  =  AA  and  M*  =  A~lM(A~x)'  also  satisfies  (5 JO).  The  in¬ 
determinacy  here  is  indicated  by  the  nonsingular  matrix  d  ^Thurstone  has  suggested 
simple  structure  as  a  means  of  identification  in  this  case  also.  Of  course,  one  needs  to  add 
a  normalization  on  each  component  of  /  or  on  each  column  of  A  (as  well  as  an  ordering 
of  the  columns  of  A). 

Thurstone  (p.  335  of  [24])  suggests  that  the  matrix  A  should  be  chosen  so  that  there 
is  a  submatrix  of  A  (obtained  by  deleting  rows  of  A)  say  A  with  the  following  properties: 
(1)  Each  row  of  A  should  have  at  least  one  zero  element.  (2)  Each  column  of  A  should 
have  zero  elements  in  at  least  m  rows  and  these  rows  should  be  linearly  independent. 
(It  should  be  pointed  out  that  the  desired  linear  independence  is  impossible  because 
these  rows  have  zero  elements  in  a  given  column  out  of  m  columns  and  hence  the  sub¬ 
matrix  of  these  rows  can  have  maximum  rank  of  m  —  1.)  (3)  For  every  pair  of  columns 
of  A  there  should  be  several  rows  in  which  one  coefficient  is  zero  and  one  is  nonzero. 
(4)  For  every  pair  of  columns  in  A  a  large  proportion  of  rows  should  have  two  zero  coefl)- 
cients  (if  m  ^  4).  (5)  For  every  pair  of  columns  of  A  there  should  preferably  be  only  a 
small  number  of  rows  with  two  nonzero  coefficients. 

It  is  extremely  difficult  to  study  the  adequacy  of  these  conditions  to  affect  identifica¬ 
tion.  Reiers^l  [19]  has  investigated  these  conditions,  modified  a  bit.  He  assumes  that 
there  are  at  least  m  zero  elements  in  each  column  of  A.  Let  A(a)  (a  =  1,  •  •  *  ,  m)  be  the 
submatrix  of  A  that  has  zero  elements  in  the  ath  column.  Reiers^l  further  assumes  that 
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(i)  the  rank  of  A(a)  is  m  —  1,  (ii)  the  rank  of  each  submatrix  obtained  by  deleting  a  row 
of  A(a)  is  m  —  1,  and  (iii)  the  addition  to  A(a)  of  any  row  of  A  not  contained  in  A(a)  in¬ 
creases  the  rank  to  m.  Then  if  AA'  is  identified,  a  necessary  and  sufficient  condition  for 
the  identification  of  A  is  that  A  does  not  contain  any  other  submatrices  satisfying  (i), 

(ii) ^and  (iii). 

/Zero  elements  in  specified  positions.  Here  we  consider  a  set  nf  ronrlitWriFi  that  require 
m  the  investigator  more  a  priori  information.  He  must  know  that  some  tests  do  not  de- 
pend  oir  some  factors,  in  this  case  the  conditions  are  that  X,a  =  0  for  certain  pairs 
(i ,  o) ;  that  is,  that  the  ath  factor  does  not  affect  the  ith  test  score.  In  this  case  we  do 
not  assume  that  £ff  =  I.  These  conditions  are  similar  to  some  used  in  econometric 
models.  The  coefficients  of  the  ath  column  are  identified  except  for  multiplication  by  a 
scale  factor  if  (.4)  there  are  at  least  m  —  1  zero  elements  and  ( B )  the  rank  of  A(o)  is 
m  —  1  (see  [13]). 

It  will  be  seen  that  there  are  m  normalizations  and  a  minimum  of  m(m  —  1)  zero  con¬ 
ditions.  This  is  equal  to  the  number  of  elements  of  A.  If  there  are  more  than  m  —  1 
zero  elements  specified  in  one  or  more  columns  of  A,  then  there  may  be  more  conditions 
than  are  required  to  take  out  the  indeterminacy  in  KA ;  in  this  case  the  conditions  may 
restrict  AMA'. 

Local  identification.  We  can  ask  the  question,  when  we  suppose  there  is  a  2  and  a  A 
satisfying  Sk  =  2  +  AA'  and  some  other  conditions  such  as  A'2-1A  being  diagonal,  is 
there  another  pair  of  such  matrices  in  the  neighborhood  of  2,  A?  In  other  words,  if  we 
change  2  and  A  by  small  amounts,  does  2  +  AA'  necessarily  change?  If  2  +  AA'  does 
change,  then  we  say  that  2  and  A  are  locally  identified.  We  can  give  a  sufficient  condi¬ 
tion  for  this. 

Theorem  5.9.  Let  ^  —  2  —  A(A'2~1A)_1A'.  If  1 <t>%  \  0,  then  2  and  A  are  locally 

identified  under  the  restriction  that  A'2-1A  is  diagonal  and  the  nondiagonal  elements  are 
different  and  arranged  in  descending  order  of  size. 

Proof.  Let  Ar  =  2  +  AA'.  Then  any  pair  of  matrices  2*,  A*  satisfying  ^  = 
2*  +  A*A*'  and  diagonal  must  also  satisfy 

(5.2 1)  '  A *(/  +  T*)  =  ¥2*"^*  , 

(5.22)  diag  2*  =  diag  (fir  -  A*A*') , 

(5.23)  A*'2*-1A*=r* 

and  the  condition  that  T*  is  diagonal.  As  will  be  seen  later,  the  above  equations  are  anal¬ 
ogous  to  a  set  of  equations  defining  some  estimates.  These  equations  define  A*  and  2* 
implicitly.  We  shall  show  that  from  these  equations  one  can  find  the  set  of  partial  deriva¬ 
tives  ( da*i)/(d\l/jk ),  (dX*a)/(c )pjk).  Under  the  conditions  of  the  theorem  the  matrix  of 
partial  derivatives  is  of  maximum  rank  (equal  to  the  number  of  elements  in  2,  A) ;  this 
is  proved  in  section  12.  The  Taylor’s  series  expansion  for  2*  and  A*  in  terms  of  'k  is 

(5.24)  (2*  -  2,  A*  -  A)  =  £(**  -  *) 

where  L  is  a  linear  function.  The  right-hand  side  is  zero  if  and  only  if  the  left-hand  side 
is  zero.  Q.E.D. 

In  a  sense  the  study  of  identifiability  is  of  more  relevance  than  the  study  of  solvability, 
for  identification  requires  that  the  investigator  specify  some  features  of  the  model  and 
he  wants  to  know  how  to  do  this.  As  far  as  solvability  goes,  in  principle,  he  either  has  it 
or  he  does  not,  and  there  is  nothing  for  him  to  do  about  it. 
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6.  Problems  of  the  population:  Determination  of  the  structure  (m) 

The  study  of  solvability  and  identification  implies  methods  of  solving  for  the  structure, 
given  the  population  of  the  observables.  If  the  conditions  of  theorem  5.1  are  satisfied, 
then  the  communalities  can  be  determined  as  indicated  in  the  proof  of  that  theorem; 
this  determines  AA'  =  C,  say.  Let  A  =  (X(1)\(2)  •  •  •  X(m)),  where  X(a)  is  the  ath  column 
of  A.  Then 

(6.1)  C  =  X^X®'  +  X«X®'  +  •  •  •  +  X(m)X(m)' . 

In  many  cases  one  determines  the  X(a),s  successively.  After  X(1)  is  found,  we  define  C(1)  = 
C  —  X(1)X(1)/  =  X(2)X(2)/  +  ‘  ‘  '  +  X(,n)X(m)/,  and  proceed  to  find  X(2).  In  turn  we  define 
C (o)  —  C(o-1)  —  X(a)X(a)/  and  find  X(a+1).  The  methods  depend  on  the  identification  con¬ 
ditions. 

(a)  Triangularity  conditions.  Since  the  first  components  of  X(2),  •  •  •  ,  X(m)  are  zero, 
the  first  column  of  C  is  XnX(1);  Xu  is  determined  from  cn  —  X2x  and  the  rest  of  X(1)  is 
found  from  the  first  column  of  C.  The  matrix  C(1)  =  C  —  X(1)X(1)/  has  only  0’s  in  the 
first  row  and  column;  since  the  first  two  components  of  X(3),  •  ■  •  ,  X(m)  are  zero,  the  sec¬ 
ond  column  of  C(1)  is  X22X®;  this  determines  X(2).  In  turn  X(3),  •  •  •  ,  X(m)  are  found  simi¬ 
larly. 

( b )  General  triangularity  conditions.  Let 


7n 

0 

0  1 

(6.2) 

II 

< 

II 

fn 

fn 

0 

=  (/w-  •  • 

/<">)  , 

'  f  ml 

f  m2 

f  mm  J 

(6.3) 

B  =  (6(6  • 

■  ■  6(m>) 

• 

Then 

CB  =  A  F'  =  X(1)/(1)/  +  • 

and 

B'CB  =  FF' 

—  fwmy 

These  two  matrix  equations  can  be  written 

(6.4)  «<•>  =  \wfai  +  •  •  •  +  X<->/BO,  a  =  1,  •  •  •  ,  m  , 

(6.5)  =  ffiifai  +  •  *  *  +  Mae,  0  £  a  =  1,  •  •  •  ,  m  . 

The  first  column  of  CB  is  Cbw  —  X(1)/n,  and  the  first  element  of  B'CB  is  6(l)/C6(1)  =  fh ; 
we  determine /n  and  X(1)  from  these,  which  only  involve  b(1).  The  second  column  of  CB 
is  Cb(2)  =  X(1)/2i  +  X(2)/2 2  and  two  more  elements  of  B'CB  are  b(1)'Cb(2)  =  f nfn  and 
b{2)'Cb(2)  -fh+fUl  we  find/21,  fn,  and  X(2)  from  these  which  involve  only  the  first 
two  columns  of  B.  In  turn  we  find  each  column  of  A;  the  ath  column  only  requires  use  of 
the  first  a  columns  of  B. 

There  is  an  alternative  method  for  finding  X(2)  after  X(1)  is  found.  Let  C(1)  = 
c  -  X(1)X(1)/  =  X(2)X(2)/  +  •  •  •  +  XWXW';  then  C<lW2>  =  and  b^'C^b™  =  f%2. 
In  turn  we  define  C(a)  =  C(a_1)  —  X(a)X(o)/  and  find  X(o+1). 

(c)  Diagonality  of  A'A.  Let  du  •  •  •  ,  dm  be  the  nonzero  roots  of  \C  —  dl\  =  0, 
ordered  in  descending  order,  and  let  X(o)  be  the  corresponding  vectors  satisfying 
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(C  -  <UTX(o)  =  0  and  X<«>'A<‘>  =  da.  (It  follows  that  X<“>'X<»  =  0,  a  ^  0.)  If  A  = 
(X(1)  •  •  •  X(m))  and  D  =  (daf>ap),  then  the  equations  can  be  written 

(6.6)  CA  =  AZ), 

(6.7)  A'A  =  D. 

These  equations  (and  the  fact  that  D  is  diagonal  with  ordered  elements)  determine 
A  and  D  uniquely.  Since  A  =  A  satisfies  the  equations  (CA  =  AA'A  =  A(A'A)),  it  is 
the  unique  solution. 

(d)  Diagonality  of  A'2_1A.  Let  dh  •  •  •  ,  dm  be  the  nonzero  roots  of  |  C2-1  —  dl\  =0 
(that  is,  of  |  C  —  <72  |  =  0)  ordered  in  descending  order,  and  let  X(a)  be  the  correspond¬ 
ing  vectors  satisfying  (CS-1  —  da/)X(a)  =  0  and  x(a)/2~1X(o)  =  da.  These  equations  can 
be  summarized  as 

(6.8)  C2-1A  =  A.  D  , 

(6.9)  A'2-1A  =  D  . 

These  equations  (for  A  and  D)  have  the  unique  solution  A  =  A  and  D  —  A'S^A. 

It  will  be  seen  later  that  there  is  a  relation  between  a  method  of  estimation  and  a  meth¬ 
od  of  determining  the  structure  from  the  population.  However,  several  methods  of  esti¬ 
mation  can  be  derived  without  the  motivation  of  finding  an  analogue  to  a  method  for 
the  population. 

7.  Problems  of  statistical  inference:  Methods  of  estimation  (IV) 

7.1.  Preliminary  remarks.  We  now  consider  drawing  a  sample  of  N  observations  on  X, 
where  X  =  A/-f  U  -f  /x,  where  /  has  the  distribution  N( 0,  I)  and  U  has  the  distribu¬ 
tion  N( 0,  2);  that  is,  N  observations  from  iV( ju,  2  +  AA').  Let  the  observations  be 
#!,•••,  xn.  In  all  methods  of  estimation  n  is  estimated  by 

(7.D  *=4s**’ 

a=l 


which  is  the  maximum  likelihood  estimate  of  n.  The  estimation  of  2  and  A  is  based  upon 
(7.2)  (*•-*)(*.-*)  '  =  -^[2 


As  is  well  known,  — — r  A  is  an  unbiased  estimate  of  the  covariance  matrix  of  X. 

N—  1 

We  shall  now  consider  a  number  of  estimation  methods  for  A  and  2.  Later  we  shall 


consider  estimation  methods  when/  is  not  considered  random,  but/0  is  a  vector  of  param¬ 
eters  for  the  ath  individual. 


7.2.  Maximum  likelihood  estimates  for  random  factor  scores  when  A  A'  is  unrestricted. 
Maximum  likelihood  estimates  were  derived  by  Lawley  [14]  for  the  case  of  random  fac¬ 
tor  scores  when  the  restriction  on  the  parameters  is  that  A'2-1A  is  diagonal  (and  the 
diagonal  elements  are  ordered  in  descending  order  of  size).  As  was  seen  earlier  this  re- 


126 


THIRD  BERKELEY  SYMPOSIUM:  ANDERSON  AND  RUBIN 


striction  merely  takes  out  the  indeterminacy  of  the  rotation  in  A.  The  logarithm  of  the 
likelihood  function  for  the  sample  is 

(7.3)  \p  iV  log  ( 2  7r )  -%N\  og|2*+A*A*'| 

a— l 

=  hPN  log  (2 s-)  —%N  log|S*+A*A*,|  tr[Wv4  (2* +A*A*')  _1] 

-%N(x- u*)' (2*+a*a*')~Hx- u*)  , 


where  we  write  fx*,  2*,  and  A*  to  denote  that  these  are  mathematical  variables.  It  will 
be  noticed  that  replacing  A*  by  A*d,  where  6  is  orthogonal,  does  not  change  the  likeli¬ 
hood  function.  Thus  if  we  find  fx*,  2*,  and  A*  to  maximize  the  likelihood  functon,  then 
H*,  2*  and  A* 8  will  also  maximize  it.  The  restriction  that  A*'2*-1A*  be  diagonal  is  a 
convenience  here  to  make  the  maximizing  variables  unique  (for  almost  all  samples). 

When  ix*  is  set  equal  to  x,  the  last  term  on  the  right  of  (7.3)  vanishes.  It  is  easy  to 
verify  that  (for  almost  all  samples)  the  likelihood  function  is  maximized  when  the 
derivatives  (subject  to  A*'2*~1A*  being  diagonal)  are  set  equal  to  zero.  The  resulting 
equations  (after  considerable  algebraic  manipulation)  are 

(7.4)  A(l  +  f)  =  A±-'A, 

(7.5)  diag  t  =  diag  {A  —  AA') , 

(7.6)  f  =  A/2~1A  , 

(7.7)  nondiag  f  =  nondiag  0  , 

where  diag  B  indicates  the  diagonal  matrix  formed  from  the  diagonal  elements  of  B  and 
nondiag  B  =  B  —  diag  B.  Equation  (7.4)  can  also  be  written 

(7.8)  Af  =  (A  —  2)2_1A  . 

These  equations  may  be  compared  with  (6.8)  and  (6.9).  It  is  seen  that  (7.8),  (7.6)  and 
(7.7)  are  similar  to  equations  defining  the  characteristic  vectors  and  roots  of  A  in  the 
metric  of  2.  A  —  2  is  the  sample  analogue  of  C.  It  is  assumed  that  the  m  largest  roots 
are  positive. 

The  above  equations  are  practically  impossible  to  solve  algebraically.  Lawley  [14] 
suggests  an  iterative  procedure  which  involves  approximating  2,  then  solving  for  A, 
then  using  this  in  (7.5)  to  get  a  new  approximation  for  etc.  In  this  paper  we  shall  not 
discuss  in  detail  computational  procedures  for  any  estimates;  we  hope  to  consider  these 
in  a  later  paper. 

7.3.  Maximum  likelihood  estimates  for  random  factor  scores  when  AA'  is  unrestricted 
and  2  =  a2I;  principal  components.  The  assumption  that  2  =  a1!,  that  is,  that  2  is  a 
diagonal  matrix  with  all  diagonal  elements  equal  is  not  an  assumption  that  would  ordi¬ 
narily  be  suitable,  but  the  assumption  leads  to  an  estimate  of  A  that  is  closely  related  to 
other  methods  we  discuss.  Here 

T  =  A'2_1A  =  —,A'A. 


(7.9) 
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The  condition  that  T  is  diagonal  is  equivalent  to  the  condition  that  A'A  is  diagonal.  The 
^equations  defining  the  maximum  likelihood  estimates  are 


(7.10) 

>-> 

+ 

II 

>> 

(7.11) 

P  o’2  =  tr  ( A  —  AA')  , 

(7.12) 

(7.13) 

nondiag  f  =  nondiag  0  . 

Comparison  of  these  equations  with  (7.4)  to  (7.7)  shows  the  effect  of  assuming  2  =  cr2/. 
We  can  write  the  above  equations  by  letting  H'=  P(t  +  I)  as 

(7.14) 

AH  =  AA , 

(7.15) 

P&2  =  tr  ( A  —  AA')  , 

(7.16) 

H  =  A'A  +  PI , 

(7.17) 

nondiag  H  =  nondiag  0  . 

Since  tr  {A  — 
—  tr  H  +  mP 

AA')  =  tr  A  —  tr  AA'  =  tr  A  —  tr  A'A  =  tr  A  —  tr  (H  —  PI)  =  tr  A 
\  we  have 

(7.18) 

P  .  1  (tr  A  —  \xE). 
p—  m 

Now  let  us  see  the  relation  of  the  above  equation  to  those  defining  the  characteristic 
roots  and  vectors  of  A.  Let  the  solutions  to  \  A  —  dl\  =  0  be  di  >  d2  >  •  •  •  >  dp, 
and  let  h,  •  •  •  ,  lp  be  the  corresponding  characteristic  vectors  [that  is,  solutions  to 
{A  —  djl)lj  —  0]  normalized  by  Ijlj  —  1.  Let  D  be  the  mX  m  diagonal  matrix  with 
di,  •  •  •  ,  dm  as  diagonal  elements  and  let  L  =  (lh  •  •  •  ,  lm).  Then 

(7.19) 

AL  =  LD , 

(7.20) 

II 

These  equations  define  D  and  L  uniquely  (with  the  condition  that  the  elements  of  D  are 
the  largest  possible).  Thus  D  =  H  and  LA  =  A,  where  A  is  diagonal.  Then  (p  —  m) <x2 

=  tr  A—tiH=  di  —  di  —  d » .  Also 

1  1  rn+l 

(7.21) 

II  -  PI  =  A'A  =  A'L'LA  =  A2  . 

Thus  the  ath  diagonal  element  of  A,  say  8a,  is  \/da  —  P,  and  X(a)  =  \/da—  Pi  a.  The 
characteristic  vectors  la  are  known  as  the  principal  components  of  A.  We  see  here  that 
these  are  proportional  to  the  maximum  likelihood  estimates  of  A  in  our  model  when 
2  =  PI.  Hotelling  [11]  suggested  this  method  when  2  =  0,  or  rather  when  2  is  very 
small;  his  point  of  view  was  that  X  had  an  arbitrary  normal  distribution  and  A f  should 
account  for  most  of  the  variability  of  X.  For  our  model  we  should  consider  his  estimate 
of  A  as  LDl/ 2. 

7.4.  Thomson's  modification  of  the  principal  component  method  for  random  factor  scores 
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when  AA'  is  unrestricted.  For  convenience  here  we  require  A'A  to  be  diagonal.  The  equa- 


tions  are 

(7.22) 

AJ  —  (A  —  2)A  , 

(7.23) 

diag  2  =  diag  (A  —  AA')  , 

(7.24) 

J  =  A'A, 

(7.25) 

nondiag  J  =  nondiag  0  . 

Given  2,  the  characteristic  vectors  of  A  —  t  corresponding  to  the  largest  characteristic 
roots  constitute  the  columns  of  A  (normalized  according  to  the  diagonal  elements  of 
(7.24),  that  is,  the  corresponding  characteristic  roots).  Thus  the  Thomson  method  [21] 
is  essentially  the  method  of  principal  components  applied  to  A  —  2. 

This  method  can  be  compared  to  the  maximum  likelihood  method  by  seeing  that  the 
maximum  likelihood  method  involves  the  characteristic  vectors  and  roots  of  A  —  2  in 
the  metric  of  2. 

7.5.  The  centroid  method.  This  method  is  based  on  the  algebra  used  to  find  A  from  C 
when  A  is  restricted  by  B' A  being  triangular  (see  section  6).  Let  1 0  be  an  initial  approxi¬ 
mation  to  2  and  let  Co  =  A  —  20.  In  applying  the  algebra  described  in  section  6  we 
choose  the  columns  of  B,  say  B0,  in  a  way  that  is  apparently  suitable  for  this  C0.  The 
first  row  of  B'0  is  b^’  =  (1, 1,  •  •  ■ ,  1) ;  then  an  element  of  Cob^  is  the  sum  of  the  elements 
of  that  row  of  C0  and  b^'Cob^  is  the  sum  of  all  elements  of  C0.  We  form  —  Co — 
WW,  and  now  apply  i^2).  The  elements  of  this  vector  are  1  or  —1.  They  are  chosen 
so  as  to  make  b(02) '  C[l)  b[2)  as  large  as  possible.  The  computation  of  C[l)  b[2)  is  easy  because 
only  addition  and  subtraction  of  elements  of  C[l)  are  involved.  In  turn  C£a)  =  C£0-1)  — 
Xjja)X£a)'  is  computed,  and  then  X£a+1)(a  =  2,  •  •  • ,  m  —  1).  Then  A0  =  (\£l)  •  •  •  X£m))  is 
a  first  approximation  to  the  estimate  of  A.  Next  A  —  A0Ao  is  computed,  and  the  diago¬ 
nal  elements  of  this  matrix  (if  nonnegative)  are  taken  for  2i.  Then,  the  same  procedure 
is  followed  to  obtain  Ax,  another  approximation  to  the  estimate  of  A.  The  matrix  taken 
for  B,  say  Bh  need  not  be  the  same  as  B0  (except  for  the  first  column).  In  turn  2,  and  A t 
are  computed  until  A  —  A,A^  is  a  close  enough  approximation  to  2,-. 

In  a  sense  the  centroid  method  is  an  approximation  to  Thomson’s  modification  of  the 
principal  components  method.  In  that  method  the  first  column  of  A  is  the  characteristic 
vector  of  A  —  t  corresponding  to  the  largest  characteristic  root.  This  vector  is  propor¬ 
tional  to  the  normalized  vector  y  (that  is,  y'y  =  1)  that  maximizes  y'(A  —  2)y,  and  y 
satisfies  (^4  —  2)y  =  J\y,  where  J\  is  the  largest  characteristic  root  of  A  —  2.  If  the 
elements  of  y  are  about  equal,  then  y  is  approximately  proportional  to  ba),  the  first  col¬ 
umn  of  B,  and  hence  Jiy  is  approximately  proportional  to  the  first  vector  of  A  found  by 
the  centroid.  Similarly  if  the  second  characteristic  vector  of  A  —  2  is  approximately 
proportional  to  b{2),  then  it  is  also  approximately  proportional  to  the  second  column  of  A 
by  the  centroid  method.  We  can  say  that  the  centroid  method  approximates  the  prin¬ 
cipal  components  method  by  trying  to  use  vectors  with  elements  ±  1  as  the  characteris¬ 
tic  vectors  of  A  —  2. 

The  big  advantage  of  the  centroid  method  is  the  ease  of  computation.  Accordingly,  it 
is  the  most  used  method. 

7.6.  Maximum  likelihood  estimates  for  random  factor  scores  when  A  is  identified  by 
specified  zero  elements.  In  this  case  we  have  £}f'  —  M,  where  M  is  not  required  to  be 
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diagonal.  However,  we  require  the  diagonal  elements  to  be  unity.  Certain  coefficients  of 
A  are  required  to  be  zero,  say 

(7.26)  X,a  -  0 ,i—  *(1,  a),  •  •  *  ,  i(pa,  a),  a  =  1,  ■  •  •  ,  m  . 

In  the  ath  column  of  A,  there  are  pa  zero  elements  and  these  are  in  rows  numbered 
i(  1,  a),  •  •  •  ,  i(pa,  a).  We  assume  that  these  conditions  effect  identification.  We  can  now 
apply  the  method  of  maximum  likelihood.  We  write  down  the  resulting  equations,  in¬ 
serting  another  unknown  p  X  m  matrix  J  (essentially  Lagrange  multipliers)  which  has 
zero  elements  where  A  does  not;  that  is, 

(7.27)  jia  =  0,  i  5*  i(  1,  a),  •  •  •  ,  i(pa,  a),  a  =  1,  •  •  •  ,  m  . 

The  equations  are 

(7.28)  diag  2  =  diag  (A~  Alfa'S!) , 

(7.29)  /'A  =  0, 

(7.30)  A #2-M  -  A'  -  A'S^AiftA'  =  (iiH  +  A'$-lA)/'2 . 

The  derivation  of  these  equations  is  given  in  section  10.  We  also  consider  in  more  de¬ 
tail  a  special  case  when  m  =  2.  The  above  equations  cannot  be  solved  algebraically,  but 
iteration  methods  can  be  devised. 

7.7.  Estimates  for  nonrandom  factor  scores  when  A  A'  is  unrestricted.  We  now  consider 
xa(a  —  1,  •  •  •  ,  N)  to  be  an  observation  on 

(7.31)  Xa  —  Afa  +  U  ■+•  n  , 
where  fa  is  a  fixed  vector.  Then  the  expected  value  of  Xa  is 

(7.32)  £Xa  =  A/a  +  n  , 
and  the  covariance  matrix  is 

(7.33)  g(Xa  -  CXaXXa  -  cx«y  =  2  . 

This  model  is  similar  to  the  usual  model  for  least  squares  (or  linear  regression)  except 
that  here  the  “independent  variates,”  the /B,  are  unknown;  the  fa  are  also  parameters. 

In  one  terminology  A,  u  and  2  are  considered  “structural  parameters”  because  they 
affect  all  the  random  variables,  and  the  fa  are  considered  “incidental  parameters”  be¬ 
cause  each  fa  affects  only  one  Xa.  The  problem  of  estimating  A  is  essentially  equivalent  to 
estimating  linear  equations  on  the  “systematic  parts”  of  Xa.  Let  =  £„•  The  hypoth¬ 
esis  that  £a  is  of  the  form  £a  -  A/a  +  /x  is  equivalent  to  the  hypothesis  that  P£a  -  y 
where  P  is  a  (p  —  m)  X  p  matrix  such  that  PA  =  0  and  Pn  —  7  (that  is,  that  £a  satis¬ 
fies  p  —  m  linear  equations). 

If  we  assume  U  has  a  normal  distribution,  the  likelihood  function  is 

(7-34)  (2<r)»A|Sr/,<gCP[~5? 

1  T— T  1  r  1  ( X{a  SkX»v/i>o  Hi)  2"1 

=  77,  J- 

The  likelihood  function  does  not  have  a  maximum.  To  show  this,  let  n\  =  0,  Xu  =  1, 
Xlv  =  0,  v  9*  1,  fu  =  Xu.  Then  the  first  term  in  the  product  in  (7.34)  is  oTi^2-  As 
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<rn  — >  0,  this  term  is  unbounded.  Thus,  the  likelihood  function  has  no  maximum,  and 
maximum  likelihood  estimates  do  not  exist.  It  might  be  observed  that  in  [15]  Lawley 
obtained  some  estimation  equations  by  setting  equal  to  zero  the  derivatives  of  the  likeli¬ 
hood  function;  it  is  not  clear,  however,  whether  these  equations  define  even  a  relative 
maximum,  and  they  obviously  cannot  define  an  absolute  maximum.  (Lawley  applied  an 
iterative  method  for  these  equations  to  some  data  and  found  that  a  an  tended  towards 
zero.) 

While  we  cannot  apply  the  method  of  maximum  likelihood  to  the  distribution  of 
Xi,  •  •  •  ,  xn  to  find  estimates  of  all  parameters,  we  can  apply  the  method  to  the  distribu¬ 
tion  of  A  =  4=^  (£a—  x)  ( Xa  x)  'to  find  estimates  of  A  and  2.  The  distribution  of  A 

a 

is  the  noncentral  Wishart  distribution  [3]  and  depends  on  2  and 


(7.35) 

when  ^  fa  =  0.  If  M  =  fafa  ,  then  the  matrix  is  AM  A';  if  we  require  M  =  /, 


then  the  matrix  is  AA'.  With  some  restrictions  on  A  to  take  out  the  rotation,  2  and  A  are 
identified. 

The  application  of  the  method  of  maximum  likelihood  to  the  distribution  of  A  is  de¬ 
tailed  in  section  11.  Of  the  resulting  equations,  one  set  of  m  is  extremely  complicated  and 
cannot  be  solved  explicitly.  The  other  equations  are  similar  to  the  equations  obtained 
by  applying  the  method  of  maximum  likelihood  to  the  case  of  random  factors. 

The  question  arises  whether  the  maximum  likelihood  estimates  for  the  case  of  random 
factors  are  suitable  for  the  case  of  nonrandom  factors.  In  section  11  we  prove  that  the 
estimates  based  on  maximizing  the  noncentral  Wishart  likelihood  function  are  asymptot¬ 
ically  equivalent  to  the  maximum  likelihood  estimates  for  random  factors  in  the  sense 
that  y/N  times  the  difference  of  the  two  respective  estimates  converges  stochastically 
to  zero.  It  would,  therefore,  appear  that  for  large  samples  in  the  case  of  nonrandom  fac¬ 
tors  one  can  use  the  maximum  likelihood  estimates  for  random  factors. 

Another  asymptotic  result  that  is  proved  in  Part  II  is  that  under  certain  suitable 
identification  conditions  the  asymptotic  distribution  of  the  maximum  likelihood  esti¬ 
mate  of  A  for  random  factors  is  the  same  whatever  the  assumption  on  the  factors. 

J  7.8.  Units  of  measurement.  In  the  preceding  sections  we  have  considered  factor  anal¬ 
ysis  methods  applied  to  covariance  matrices.  In  many  cases  the  unit  of  measurement  of 
each  component  of  x  is  arbitrary.  For  instance,  in  psychological  tests,  the  unit  of  scoring 
has  no  intrinsic  meaning.  We  now  consider  how  changes  in  the  units  of  measurement 
affect  the  analysis. 

Changing  the  units  of  measurement  means  multiplying  each  component  of  a;  by  a 
constant;  we  are  interested  in  cases  where  not  all  of  these  constants  are  equal.  It  would 
be  desirable  that  when  a  given  test  score  is  multiplied  by  a  constant  the  factor  loadings 
for  the  test  are  multiplied  by  the  same  constant  and  the  error  variance  is  multiplied  by 
square  of  the  constant.  Suppose  Dx  —  x*,  where  D  is  a  diagonal  matrix  and  not  all  the 
diagonal  elements  are  the  same.  Then  gx*  =  Du  =  u*,  say,  and 


(7.36)  g(x*  -  u*)(x*  -  U*Y  =  DVD  =  DA(DA)'  +  DVD  =  **  , 
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say.  Now  represent  this  as 

(7.37)  4'*  =  A*A*'  +  2*  . 

Clearly  2*  can  be  taken  as  D'LD  and  A*A*'  can  be  taken  as  DA(DA)'  (and  must  be 
taken  this  way  if  2  and  AA'  are  identified),  but  whether  A*  can  be  taken  as  DA  depends 
on  what  kind  of  restrictions  are  imposed  on  A  and  A*  to  make  each  unique.  If  A  (and  A*) 
is  required  to  have  an  upper  triangular  matrix  of  0’s,  then  so  does  DA  and  DA  =  A*.  If 
B'A  (and  B'A*)  is  required  to  have  an  upper  triangle  of  0’s,  then  usually  B'DA  will  not, 
and  DA  7^  A*.  If  A' A  (and  A*'A*)  is  required  to  be  diagonal  then  usually  (DA)' DA  = 
A'D-A  will  not,  and  DA  A*.  If  A'2-1A  (and  a *,2*-1A*)  is  required  to  be  diagonal,  then 
(DA)'(D'ZD)~lDA  =  A'2_1A  and  DA  =  A*. 

Now  let  us  see  how  the  estimation  methods  depend  on  the  units  of  measurement. 
Let  Dxa  =  £*.  Then  N A*  —  (x*  —  x*)(x*  —  x*)'  =  NDAD.  The  equations  for 

the  maximum  likelihood  estimates  of  section  7.2  are  then 

(7.38)  A* (7  +  f  *)  =  DAD±*~'A*  , 

(7.39)  diag  t*  =  diag  (DAD  -  A* A*')  , 

(7.40)  f  *  =  A*'!*-^*  , 

(7.4 1)  nondiag  f  *  =  nondiag  0  . 

Clearly  A*  =  Z)A,  and  2*  =  D2D  is  the  solution  [when  A  and  2  is  a  solution  to  (7.4) 
to  (7.7)].  Then  the  results  of  this  method  do  not  essentially  depend  on  the  units  of 
measurement. 

The  second  estimation  procedure  considered  assumes  2  =  <PI.  In  the  new  units 
2*  =  D2D  =  a2D2  which  is  not  proportional  to  7  and  therefore,  if  this  method  is  appli¬ 
cable  to  4r,  it  is  not  applicable  to  \k*. 

In  the  third  method  the  transformed  equations  are 

(7.42)  A *7*  =  (DAD  —  2*)A*  , 

(7.43)  diag  2*  =  diag  (DAD  -  A*A*')  , 

(7.44)  ‘  J*  =  A*'A*  , 

(7.45)  nondiag  J*  =  nondiag  0  . 

We  know  that  because  of  (7.44)  and  (7.45)  A*  5^  DA,  but  we  can  ask  the  question 
whether  A*  =  DAP,  where  P  is  orthogonal;  that  is,  whether  A*A*'  =  DAA'D  (whether 
A*  defines  the  same  factor  space  as  DA).  If  A*A*'  =  DAA'D,  then  2*  =  DZD  and 
(7.42)  can  be  written 

(7.46)  (D~XA*)J*  =  (A  -  2)D2(77-1A*)  . 

This  indicates  that  the  diagonal  elements  of  J*  are  the  m  largest  roots  of 

(7.47)  \(A  -  2)  -  JD~2\  =  0, 

and  the  columns  of  IT^A*  are  the  corresponding  vectors  satisfying 

(7.48)  [(A  -  2)D2  -  Ja  7]X(a)  =  0  . 

However,  the  roots  of  (7.47)  will,  in  general,  not  be  the  roots  of  A  —  2  and  the  vectors 
satisfying  (7.48)  will  not  span  the  same  linear  subspace  as  the  first  m  characteristic  vec- 
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tors  of  A  —  2.  Thus  changing  the  units  of  measurement  will  change  the  estimated  fac¬ 
tor  space  in  the  Thomson  method. _ j 

Now  let  us  consider  thecentroid  method.  Since  we  know  that  if  B'A  has  an  upper 
triangle  of  0’s,  then  B'DA  in  general  will  not,  we  ask  whether  the  centroid  method  ap¬ 
plied  to  A*  -  DAD  will  give  A*A*'  =  DAA'D;  that  is,  whether  A*  =  DAP,  where  P 
is  some  orthogonal  matrix.  In  the  original  metric,  we  have  CB  =  A F',  where  C  = 
A  —  2  and  diag  2  =  diag  (A  —  AA').  If  A*  =  DAP,  then  diag  2*  =  diag  (DAD  — 
DAA'D)  =  diag  DlD  and  C*  =  DCD.  Then  A*F*'  =  C*B  =  DCDB.  Let  A*  =  DL 
Then  KF*'  —  CDB  and  we  ask  whether  A  =  A P.  This  can  be  true  in  general  only  if 
DB(F*')~l  =  B(F')~lP;  that  is,  only  if  DBQ  =  B  for  some  nonsingular  Q.  In  general, 
this  is  not  true  (only  if  the  m  columns  of  B  lie  in  an  m-dimensional  space  spanned  by 
some  m  characteristic  vectors  of  D).  However,  in  the  centroid  method  the  choice  of  B  is 
left  to  the  investigator,  subject  to  the  conditions  that  the  first  column  is  composed  of  l’s 
and  the  other  columns  have  l’s  and  —  Ts  as  elements.  Thus,  the  B*  used  for  A*  would 
usually  not  be  the  B  used  for  A.  Then  we  would  need  DB*Q  =  B.  While  it  is  hard  to  de¬ 
scribe  exactly  how  B  is  chosen  by  the  investigator,  we  can  say  roughly  that  the  columns 
of  B  are  selected  as  characteristic  vectors  of  C,  and  thus  the  columns  of  A  are  approxi¬ 
mately  proportional  to  the  first  m  characteristic  vectors  of  C.  But  in  the  latter  case  we 
have  shown  that  the  transformation  of  A  to  DAD  does  not  transform  AA'  to  DAA'D; 
hence,  we  can  conclude  that  to  the  extent  that  the  centroid  method  approximates  the 
principal  components  method  (applied  to  C),  it  does  not  transform  properly  with  changes 
of  scale  of  measurement. 

In  the  case  where  A  is  identified  by  0  coefficients  in  specified  positions,  DA  satisfies 
the  same  conditions  and  hence  A*  =  DA.  In  estimation  /*  has  0’s  specified  in  the  same 
positions  as  in  /.  It  is  a  straightforward  matter  to  show  that  A*  =  DA,  2*  =  D2D, 
At*  =  M,  and  J*  =  D~lJ  satisfy  (7.28),  (7.29)  and  (7.30)  when  A*  =  DAD. 

In  the  case  of  nonrandom  factor  scores  we  suggest  applying  the  method  of  maximum 
likelihood  to  the  likelihood  of  A.  It  can  be  seen  from  results  in  Part  II  that  DA  and 
D2D  satisfy  the  condition  for  removing  the  rotation  from  A*  (and  A).  The  value  of  the 
likelihood  function  at  A,  2,  A  is  the  same  as  at  DAD,  DZD,  DA;  hence,  the  maximum 
for  A*  —  DAD  is  at  2*  —  DtD,  A*  =  DA  when  the  maximum  for  A  is  at  2,  A. 

As  has  been  noted  above,  the  estimation  of  A  by  the  centroid  or  Thomson’s  principal 
components  method  depends  essentially  on  the  units  of  measurement  of  the  test  scores, 
even  though  these  units  may  have  no  intrinsic  meaning.  A  practical  remedy  to  this  un¬ 
desirable  indeterminacy  is  to  prescribe  a  “statistical”  unit  of  measurement.  It  is  custom¬ 
ary  to  let  the  sample  determine  the  unit  of  measurement  by  requiring  that  each  test 
score  have  sample  variance  1.  Thus  da  is  taken  to  be  l/\/aii-  The  new  matrix  is  R  = 
(rn),  where  r»v  =  a,,/  V Qua;,  are  the  sample  correlation  coefficients.  Besides  taking  out 
the  indeterminacy,  this  convention  has  some  other  advantages.  From  the  practical  point 
of  view  it  is  convenient  to  have  the  diagonal  elements  unity  and  the  other  numbers  be¬ 
tween  —  1  and  -f- 1 ;  this  makes  it  easier  to  find  rules  of  thumb  and  convenient  computa¬ 
tional  procedures. 

The  centroid  method  is  an  approximation  to  the  modified  principal  components 
method.  If  we  compare  the  equations  for  the  latter  with  those  for  the  maximum  likeli¬ 
hood  solution,  we  see  that  when  2  is  roughly  proportional  to  I,  then  the  principal  com¬ 
ponent  estimates  are  close  to  the  maximum  likelihood  estimates,  which  have  certain  de¬ 
sirable  properties  (for  example,  asymptotic  efficiency).  If  the  transformation  to  test 
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scores  with  unit  sample  variance  tends  to  make  the  error  variances  (<rtl)  approximately 
equal,  then  the  efficiency  of  these  procedures  is  presumably  improved.1 

It  might  be  pointed  out  that  the  assumption  of  section  7.3  that  <r2  =  I  is  a  little  less 
restrictive  than  it  seems.  Suppose  one  knows  the  error  variances  except  for  a  constant  of 
proportionality  a 2.  Then  2  =  <PD~2,  say,  where  D~2  is  known.  Then  we  can  let  Dxa  =  x* 
and  apply  the  principal  components  method  to  A*  =  DAD.  It  might  also  be  noted  that 
Whittle  [28]  has  treated  the  nonrandom  factor  case  under  the  assumption  that  2  =  a2I 
and  has  obtained  a  solution  for  A  in  terms  of  the  principal  components  of  A. 

7.9.  Invariance  of  factor  loadings  under  changes  of  factor  score  populations.  Now  let  us 
consider  the  model  X  =  A/  +  u  +  U,  where  £ff  =  M  is  not  necessarily  required  to 
be  the  identity,  and  where  /  and  TJ  are  considered  random.  Of  the  various  ways  of  iden¬ 
tifying  A  (where  2  is  identified),  consider  (a)  B'A  has  an  upper  triangle  of  0’s  and  M  =  I, 
(b)  A' A  is  diagonal  and  M  =  I,  ( c )  A'2~lA  is  diagonal  and  M  —  I,  and  (d)  A  has  speci¬ 
fied  0’s.  Only  the  last  does  not  involve  M. 

A  mathematical  factor  analysis  is  supposed  to  be  a  representation  of  some  real  popu¬ 
lation  of  individuals  from  which  we  sample  randomly.  In  defining  such  a  representation 
it  is  desirable  that  at  least  certain  parts  of  the  model  do  not  change  even  though  the 
population  is  changed.  For  example,  consider  a  model  for  certain  mental  test  scores  of  a 
certain  population,  say,  boys  of  age  16  in  New  York  State.  Then  consider  a  subpopula¬ 
tion,  say,  boys  of  age  16  in  eleventh  grade  in  New  York  State.  Can  the  same  model  apply 
to  this  subpopulation?  To  put  it  another  way,  if  one  investigator  factor-analyzes  the  first 
population  and  another  analyzes  the  second,  what  results  of  the  analyses  might  be  com¬ 
mon  to  the  two  studies  (see  also  [18])? 

If  the  definition  of  the  subpopulation  is  independent  of  /  and  U  (that  is,  does  not  de¬ 
pend  on  the  factor  scores  and  “errors”  including  specific  factors),  then  the  subpopulation 
is  a  miniature  of  the  first  and  any  model  for  the  first  furnishes  the  same  model  for  the 
second.  However,  in  the  example  above  it  would  seem  reasonable  that  the  subpopulation 
involves  a  selection  based  on  the  factor  scores  related  to  the  set  of  tests  considered  (as 
well  as  other  factors). 

Let  us  consider  what  happens  in  the  above  model  if  /  is  replaced  by  g,  where  £g~  7 
and  £{g  —  7 )(g  —  7)'  =  P.  Then  in  the  subpopulation 

(7.49)  X*  =  Ag  +  M  =  U=  A(g-y)  +  (u+Ay)+U. 

The  investigator  is  going  to  represent  this  as 

(7.50)  X*  =  A*f*  +  n*  +  U*  , 

where  £U*  =  0,  £f*  =  0,  £U*U*'  =  2*,  £f*f*'  =  M*  and  A*  and  M*  satisfy  the 
identification  conditions. 

Let  u*  —  m  +  A7,  U*  =  U,  2*  =  2,  and  /*  =  Q(g  —  7)  and  A*  =  A Q~x  for  some 
nonsingular  Q.  It  is  clear  that  the  columns  of  A*  span  the  same  space  as  the  columns 
of  A.  If  (d)  is  used  for  identification  Q  must  be  diagonal,  and  each  column  of  A*  must  be 
proportional  to  the  corresponding  column  of  A;  also  qu  =  If  the  normalization 

1  Whittle  [28]  has  suggested  that  if  one  assumes  the  variance  of  the  measurement  is  proportional  to 
the  error  variance,  then  it  is  reasonable  to  use  the  correlation  matrix.  In  the  case  of  nonrandom  factor 
scores,  he  has  assumed  =  can,  but  finds  he  is  led  to  principal  components  of  R  only  in  the  case 

V 

of  tn  —  1. 
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of  a  column  of  A  (and  A*)  is  done  by  a  rule  involving  only  that  column  (for  example,  by 
means  of  making  a  specified  element  equal  to  one),  then  that  column  of  A  is  equal  to 
that  column  of  A*.  It  can  also  be  shown  that  if  simple  structure  effects  identification  of  A 
in  the  original  population,  it  will  in  the  second  and  will  lead  to  a  A*  with  proportional 
columns.  In  the  case  of  identification  by  methods  (a),  ( b ),  or  (c)  A*  is  not  related  as 
simply  to  A.  In  each  case  M*  —  QPQ '  =  I.  In  (a)  Q  also  satisfies  B'A*  =  B'AQ~l  =  F* 
(with  upper  triangle  of  0’s);  in  ( b )  A*'A*  =  (Q^/A'AQ-1  is  diagonal;  in  (c)  A*'2_1A*  = 
(Q~1)' A'X~1AQ~1  is  diagonal.  In  each  of  these  cases  A*  will  in  general  not  be  a  rotation 
of  A.  Thus,  only  if  identification  does  not  essentially  involve  M  can  one  hope  that  the 
results  of  a  factor  analysis  for  one  population  will  bear  a  simple  relation  to  the  results 
for  another  population  if  the  two  populations  differ  with  respect  to  the  factors  involved. 

There  seems  to  have  been  a  considerable  discussion  by  psychologists  of  the  require¬ 
ment  that  M  =  I.  Some  claim  that  the  orthogonality  (that  is,  lack  of  correlation)  of  the 
factor  scores  is  essential  if  one  is  to  consider  the  factor  scores  as  more  basic  than  the  test 
scores.  However,  if  the  factor  scores  are  orthogonal  for  some  population,  in  general  they 
will  not  be  orthogonal  for  another  population  or  for  a  subpopulation.  Hence,  this  require¬ 
ment  would  seem  to  lead  to  a  less  basic  definition  of  factors. 

We  might  also  consider  the  effect  of  the  use  of  correlations  in  factor  analysis  on  the 
comparability  of  analyses  of  different  populations.  In  the  original  population  4r  = 
AM  A'  +  2  and  R  =  DVD  =  (DA)M(DA)f  +  D2D,  where  d2u  =  ^  +  va¬ 

in  the  second  population  V*  =  APA'  +  2  and  R*  =  D*V*D*  =  (D*A)P(D*A)'  + 
D*2D*,  where  d*2  =  +  an.  Then  the  relation  of  the  factor  loading  matrix 

of  R*  to  that  of  R  is  further  complicated  by  the  premultiplication  and  postmultiplication 
of  a  diagonal  matrix  that  depends  on  the  subpopulation  (that  is,  on  P).  Thus  the  use  of 
correlations  instead  of  covariances  makes  the  comparison  of  factor  loadings  in  two  popu¬ 
lations  more  difficult. 

A  question  related  to  the  above  is  what  happens  to  the  analysis  if  tests  are  added  (or 
deleted).  Let 

(7.51)  X*  =  A  *f+v*+U*, 

where  X*  is  a  vector  of  added  test  scores.  Then  the  entire  set  consists  of  the  components 
of  X  and  X*.  We  assume  EU*  =  0,  EUU*'  =  0,  EU*U*'  =  2*,  a  diagonal  matrix. 
What  identification  conditions  leave  A  unaffected?  The  conditions  in  terms  of  the  entire 
set  of  tests  are  (A' A *')B  —  A'B  +  A*'B  is  a  triangular  matrix  in  (a),  (A'A*')(A,A*/)'  = 
A' A  -f  A*' A*  is  diagonal  in  ( b ), 

(7-52)  (aA*)’(o  =  -a* 

is  diagonal  in  ( c ).  In  general,  these  will  not  be  satisfied.  In  (d),  however,  the  restrictions 
are  the  same.  Thus  in  the  first  three  cases  addition  of  new  tests  will  lead  usually  to  a 
rotation  of  A. 

7.10.  Asymptotic  distributions  of  estimates.  For  any  of  the  estimation  procedures  de¬ 
scribed  in  this  paper,  it  would  be  desirable  to  have  the  distribution  of  the  estimates. 
However,  in  general  the  exact  distribution  of  any  set  of  estimates  is  virtually  impossible 
to  obtain.  The  best  we  can  hope  for  is  an  asymptotic  distribution  theory  for  a  set  of 
estimates. 
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In  section  12  we  prove  that  the  maximum  likelihood  estimates  given  in  section  7.2  are 
asymptotically  normally  distributed  if  the  matrix  (<£23)  is  nonsingular,  where  $  = 
2  —  A(A'2-1A)-1A/;  the  condition  is  implied  by  the  condition  that  2  and  A  are  identi¬ 
fied.  Some  of  the  asymptotic  variances  and  covariances  are  given  in  section  12.  Unfortu¬ 
nately,  the  variances  and  covariances  of  the  elements  of  A  are  so  complicated  that  they 
cannot  be  used  for  all  the  usual  purposes. 

The  asymptotic  variances  and  covariances  of  the  estimates  in  section  7.3  have  been 
given  by  Lawley  [16]  and  the  asymptotic  normality  has  also  been  proved  [4].  However, 
the  assumptions  underlying  this  theory  (2  =  <x2/)  are  so  restrictive  that  it  would  appear 
that  the  theory  is  not  of  much  applicability. 

In  section  12  it  is  stated  that  the  modified  principal  component  estimates  (of  section 
7.4)  are  asymptotically  normally  distributed  if  (0<3)  is  nonsingular,  where  0  =  7  — 
A(A'A)-1A'.  The  asymptotic  variances  and  covariances  can  be  found  in  a  fashion  similar 
to  those  of  the  maximum  likelihood  estimates.  Again  they  are  very  complicated. 

The  centroid  estimates  are  not  defined  explicitly  in  terms  of  mathematical  operations 
because  the  investigator  chooses  B  somewhat  subjectively.  Hence,  we  cannot  define  any 
asymptotic  distribution  theory.  It  is  possible  to  formalize  the  procedure  by  assuming 
that  p  =  k2m,  where  k  is  an  integer,  and  defining  ba)>  =  (1,  1,  •  •  •  ,  1),  bm  consists  of 
p/2  l’s  and  p/2  —  l’s  such  that  6(2)/C(1)6(2)  is  a  maximum,  b{Z)  consists  of  p/ 4  l’s  where 
bm  has  l’s,  pf 4  —  l’s  where  6(2)  has  l’s,  etc.,  such  that  6(3),C(2)6(3)  is  a  maximum,  etc. 
This  procedure,  however,  is  very  difficult  to  study. 

In  the  case  of  maximum  likelihood  estimates  when  A  is  identified  by  zero  elements, 
the  estimates  are  asymptotically  normally  distributed.  The  proof  of  this  theorem  (theo¬ 
rem  12.3)  is  not  given  because  it  is  extremely  complicated. 

When  the  factor  scores  are  nonrandom  (section  7.7),  it  is  stated  in  section  11  that  the 
estimates  based  on  the  distribution  of  A  are  asymptotically  equivalent  to  the  estimates 
given  in  section  7.2  in  the  sense  that  ViV  times  the  difference  of  two  estimates  converges 
asymptotically  to  zero.  Thus,  as  far  as  asymptotic  normality  and  asymptotic  variances 
and  covariances  are  concerned,  the  two  methods  are  equivalent.  It  follows  from  theorem 
12.1  that  these  estimates  are  asymptotically  normally  distributed. 

8.  Problems  of  statistical  inference:  Tests  of  hypotheses  and  determination  of  the 

number  of  factors 

8.1.  Test  of  the  hypothesis  that  the  model  fils  (F).  In  the  discussion  of  estimation  we  have 
assumed  that  the  model  is  proper  for  the  relevant  data;  in  particular,  we  have  assumed 
that  m,  the  nnmher  of  factors,  is  known.  In  this  section  we  consider  testing  the  hypoth¬ 
esis  that  Sk,  the  population  covariance  matrix,  can  be  written  as  2  +  AA',  where  A  has  a 
specified  number  of  columns.  There  are  other  assumptions  of  the  model,  such  as  nor¬ 
mality,  linearity  of  effects  of  factors,  etc.,  which  may  be  questioned,  but  we  will  not  con¬ 
sider  them  here. 

One  method  of  obtaining  a  test  of  the  hypothesis  ^  =  2  +  AA'  is  to  derive  the  likeli¬ 
hood  ratio  criterion  under  the  conditions  of  section  7.2.  The  likelihood  function  can  be 
written 

(8.1)  L(A,%n)  =  (2tt)~pN^2 I Sk | -iv/2 exp  { —  [AT tr  .4 Sir*1  +  N(x  —  /z)/4'_1(^  —  n)]/2 }  . 
The  alternatives  to  the  hypothesis  are  that  4'  is  any  positive  definite  matrix.  Under  the 
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alternative  hypotheses  p  =  x  and  $  =  A.  Under  the  null  hypothesis  p  =  x  and  +  = 
2  +  AA',  where  2  and  A  are  defined  in  section  7.2.  The  likelihood  ratio  criterion  is 

(8  7)  L(A,2+AA',  p) _ 

{  ■  ]  L  (A,  A,  p)  |2+AA;  tr  A(£+AA')-V*’ 

The  exponent  in  (8.2)  involves 

(8.3)  tr  A  (2  +  AA/)~1-p  =  tr  (A  -2  -  AA')  (2  +  AA')  -1 

=  tr  ( A  -  2  -  AA')  [2~l  -  2~lA  (/  +  f )  ^A'D-1] 


=  tr  ( A  —  2  —  AA')  2-1  —  tr  ( A  —  2  —  AA') 

x2-xA(/  +  f)  -'A't-1 


=  0 


because  (7.5)  implies  the  diagonal  elements  of  (A  —  2  —  AA')^-1  are  0  and  (7.8)  im¬ 
plies  (A  —  2  —  AA')2-1A  =  0.  Because  |7  +  PQ\  =  \I  +  QP |, 

(8.4)  12+AA'l  =  \t\  •  |/+AA'2-M  =  \  t\  •  |/+a'I:-1a| 

=  iii*|/+ri. 

It  is  convenient  to  consider  (—2)  times  the  logarithm  of  the  likelihood  ratio  criterion 
which  is 

(8.5)  Um  =  N[log\2\  +  log|J+f|  —  log  |-4|]- 

The  test  procedure  is  to  reject  the  hypothesis  if  Um  exceeds  a  number;  this  number  is 
chosen  to  give  the  desired  significance  level.  While  the  exact  distribution  of  Um  is  not 
known,  the  usual  asymptotic  theory  tells  us  that  if  ($,-)  is  nonsingular  Um  is  asymptoti¬ 
cally  distributed  as  x2  with  number  of  degrees  of  freedom  equal  to  C  =  p(p  +  l)/2  + 
m{m  —  l)/2  —  p  —  pm. 

The  diagonal  elements  of  f  in  section  7.2  are  the  m  largest  roots  of 

(8.6)  \A-2-yt\  =0. 

Let  7^1,  •  •  •  ,  7P  be  the  other  roots  of  (8.6).  Then  it  can  be  shown  that 


(8.7)  tt^(2+AA')-,-?=  2  *<• 

t  =  m+l 


(8.8)  |  A  (2  +  AA')  -1 1  =  fj  (1+7.). 

t  —  m+l 

Thus  the  criterion  is 

(8.9)  Um  =  -N  2  log  (I+7.). 

t=-m  +  l 

We  can  give  an  intuitive  interpretation  of  this  test.  2  and  A  are  found  so  that  A  — 
(2  +  AA')  is  small  in  a  statistical  sense,  or  equivalently  so  A  —  2  is  approximately  of 
rank  m.  If  A  —  2  is  approximately  of  rank  m,  the  smallest  p  —  m  roots  of  (4.6)  should 

be  near  zero.  The  criterion  measures  in  a  certain  way  the  deviation  of  the  smallest  roots 

*2 

T* 

i  —  m  +1 
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This  test  was  proposed  by  Lawley  [14].  Bartlett  [6]  has  suggested  replacing  the  fac¬ 
tor  N  in  the  criterion  by  A7  —  (2 p  +  ll)/6  —  2 m/3.  It  might  be  noted  that  to  justify 
the  statement  that  Um  has  an  asymptotic  x2-distribution  it  is  necessary  to  verify  that 
A  and  2  have  an  asymptotic  normal  distribution  (or  an  equivalent  statement). 

We  can  also  use  the  likelihood  ratio  criterion  if  the  identification  is  by  elements  of  A 
specified  zero.  Then 

(8.10)  \4r\  =  \2+k&k'\  =  |2|.(J+ AMA'S-1! 

=  |2| •  |/  +  |  -  1 2 1  - 1  1  =  |2|.|r-|-|tf| , 

and  by  (10.18) 

(8.11)  tr  V-'A  -  p  =  tr  (^~M  -  I) 

=  tr  [tr^A  -#)  - 

=  tr  2~l(A  -*)-  tr  2~'kJ'2  -  tr  2kJ'AMk' 

=  0 


because  diag  (A  —  +)  =  0,  J'k  =  0,  and  diag  kJ'  —  0.  Thus  in  this  case 

(8.12)  Um  =  A[log  \2\  +  log  |I£|  -  log  |f  1  -  log  |i4|] . 


When  the  null  hypothesis  is  true,  Um  is  distributed  as  x2  with  number  of  degrees  of  free¬ 
dom  equal  to  p(p  +l)/2  —  p  —  pm  —  m(m  —  l)/2  plus  the  number  of  0’s  specified 
in  A  (at  least  m(m  —  1)). 

Bartlett  [6],  [8]  has  proposed  another  test  procedure.  Let  zx  >  z2  >  •  •  •  >zp  be  the 
roots  of 

(8.13)  \R  —  zl\  ~  0  . 

The  criterion  suggested  is 


(8.14) 


N  ■ 


2  p  +11  2  ms 


2  *« 


(p  -  m)  log  ~  .  2+ilo«  Zi 


This  criterion  is  small  if  the  last  p  —  m  roots  of  R  are  nearly  equal..  It  is  difficult  to  relate 
this  test  to  the  factor  analysis  model.  The  test  can  be  expected  to  be  consistent  if  the 
population  correlation  matrix  is  of  the  form  <r2/  +  AA';  intuitively  the  test  judges 
whether  R  —  kk'  is  approximately  proportional  to  I  (see  [4]).  Another  difficulty  with 
this  procedure  is  that  even  its  asymptotic  distribution  is  unknown. 

There  are  other  ways  of  deciding  whether  A  —  2  —  kk!  is  sufficiently  small  to  ac¬ 
cept  the  hypothesis  that  ^  —  S  —  AA'  is  zero.  As  was  noted  earlier,  it  is  common  prac¬ 
tice  to  apply  the  centroid  method  to  the  correlation  matrix.  If  2  and  A  are  now  the  esti¬ 
mates  based  on  this  method,  one  wants  to  decide  whether  the  elements  of  R  —  2  —  kk' 
are  sufficiently  near  zero.  Frequently,  rules  of  thumb  are  used,  such  as  deciding  the  ele¬ 
ments  are  sufficiently  near  zero  if  each  element  is  within  .05  of  being  zero.  It  is  obviously 
difficult  to  investigate  the  theory  of  such  rules. 

8.2.  Determination  of  the  number  of  factors  (F7).  In  many  cases  the  investigator  does 
not  know  the  number  of  factors.  He  may  not  even  be  in  the  position  of  postulating  a 
specific  value  of  m.  In  such  situations  the  statistical  problem  is  not  one  of  testing  the 
hypothesis  that  m  is  a  given  number,  but  rather  of  determining  an  appropriate  number 
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of  factors.  The  investigator  wants  to  decide  the  smallest  number  of  factors  such  that  the 
corresponding  model  fits  the  data. 

What  is  desired  is  a  multiple  decision  procedure.  Such  a  procedure  could  be  described 
by  a  function  h(A),  that  takes  on  the  values  0,  1,  •  •  ■  ,  M,  where  M  is  the  maximum 
number  of  factors  that  could  be  needed.  One  would  like  such  a  function  that  had  some 
desirable  properties.  Unfortunately,  there  is  no  such  procedure  for  which  we  have  an 
adequate  statistical  theory,  even  an  asymptotic  theory. 

In  a  situation  such  as  this  it  is  common  practice  to  make  a  sequence  of  tests.  In  this 
case  one  might  test  the  hypothesis  that  m  =  mo  against  the  alternative  that  m  >  m0;  if 
this  hypothesis  is  rejected,  test  m  —  mo  +  1  against  the  alternative  m  >  m0  +  1,  etc.  How¬ 
ever,  even  if  we  know  the  significance  level  of  each  test  separately,  we  do  not  know  the 
probabilities  associated  with  the  entire  procedure;  that  is,  if  m*  is  the  true  number  of 
factors,  we  do  not  know  (even  asymptotically)  what  the  probability  is  that  our  procedure 
leads  to  the  decision  m  =  m*.  Perhaps  all  that  can  be  said  is  that  the  probability  of  say¬ 
ing  m  >  m*  (3:  mo)  is  not  greater  than  the  significance  level  of  the  test  of  m  =  m*. 

Another  kind  of  procedure  that  might  be  considered  is  another  sequence  of  tests, 
namely  test  hypotheses  m  =  mo  against  the  alternative  m  =  mo  +  1,  if  this  is  rejected 
test  m  =  mo  +  1  against  m  =  w0  +  2,  etc.  In  the  case  of  likelihood  ratio  tests,  this 
would  involve  the  criterion  Um„  —  Um#+i,  then  Umo+ 1  —  Umo+ 2,  etc.  However,  here  we 
do  not  know  the  asymptotic  distribution  of  the  criterion. 

Using  a  sequence  of  likelihood  ratio  tests  is  computationally  difficult.  For  at  each 
stage  one  must  compute  2  and  t  for  a  value  of  m  and  at  the  next  stage  one  must  com¬ 
pute  another  2  and  f . 

In  practice  ad  hoc  rules  are  frequently  used  in  dealing  with  R,  such  as  using  a  rule  of 
thumb  to  determine  2,  then  using  the  centroid  method  to  estimate  successively  columns 
of  A  until  the  elements  of  R  —  2  —  AA'  are  sufficiently  small,  say  within  .05  of  being 
zero.  We  do  not  know  the  statistical  properties  of  such  procedures.  It  has  also  been  pro¬ 
posed  that  after  following  such  an  ad  hoc  procedure,  the  investigator  then  use  the  likeli¬ 
hood  ratio  test  to  test  the  hypothesis  that  m  is  equal  to  the  number  determined  by  the 
ad  hoc  procedure.  Again  we  do  not  have  any  statistical  theory  for  the  procedure. 

8.3.  Tests  of  hypotheses  (VII).  There  are  many  hypotheses  about  A  and  2  that  an 
investigator  might  consider.  For  example,  he  might  be  interested  in  whether  a  specified 
\ia  is  zero,  that  is,  whether  a  given  factor  does  not  enter  a  given  test.  In  this  connection, 
he  might  also  want  a  confidence  interval  for  a  specified  X»a.  In  principle,  it  is  possible  to 
give  a  large  sample  procedure  in  such  cases  if  one  has  a  consistent  estimate  which  is 
asymptotically  normally  distributed,  if  one  knows  the  theoretical  asymptotic  variance 
(in  terms  of  A  and  2),  and  if  one  has  consistent  estimates  of  the  parameters  involved  in 
the  asymptotic  variance.  Unfortunately,  in  practice  this  is  extremely  difficult  because 
the  asymptotic  variances  are  complicated  functions  of  the  parameters.  It  might  be  noted 
that  when  A  is  identified  by  arbitrary  mathematical  conditions  (such  as  diagonality  of 
A'2-1A),  these  hypotheses  do  not  have  much  significance. 

Another  hypothesis  that  might  be  of  interest  is  the  hypothesis  that  all  factor  loadings 
for  a  specified  test  are  zero.  In  our  model  this  is  equivalent  to  the  hypothesis  that  the 
specified  test  score  is  independent  of  the  other  test  scores.  Such  a  hypothesis  can  be 
tested  by  using  the  multiple  correlation  coefficient  between  the  specified  test  and  the 
other  tests. 

Another  hypothesis  is  whether  a  given  tetrad  difference  is  zero.  This  is  relevant  to  the 
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question  of  whether  some  four  tests  meet  the  conditions  for  a  single  factor  model.  Al¬ 
though  a  fair  amount  of  work  has  been  done  on  this  problem,  we  shall  not  treat  it  be¬ 
cause  we  are  interested  here  in  problems  for  a  general  number  of  factors. 

9.  Problems  of  statistical  inference:  Estimation  of  factor  scores  (VIII) 

If  one  postulates  a  model  where  the  factor  scores  are  not  random,  then  the  observed 
test  score  vector  for  the  ath  person  is  an  observation  on  A/0  +  M  +  U,  where  fa  is  a  vec¬ 
tor  of  parameters.  Given  a  sample  of  test  score  vectors,  one  from  each  of  N  individuals, 
one  can  ask  for  estimates  of  the  N  factor  score  vectors,  /1,  •  •  •  ,  /at.  If  one  postulates  a 
model  where  the  factor  scores  are  random,  one  can  consider  the  conditional  distribution 
of  the  set  of  test  score  vectors  given  that  the  factor  score  vectors  are  fixed  vectors, 
/1,  '  '  ‘  ,/n-  Then  this  conditional  distribution  is  the  same  as  the  distribution  with  non- 
random  factor  score  vectors. 

As  was  indicated  in  section  7.6,  we  cannot  apply  the  method  of  maximum  likelihood 
to  the  problem  of  simultaneous  estimation  of  2,  A,  /z,  and/i,  ••*,/#.  A  reasonable  pro¬ 
cedure  seems  to  be  to  estimate  2,  A  and  n  and  then  consider  estimating  /1,  *  •  •  ,  /n  as¬ 
suming  that  2,  A  and  n  are  known  and  are  equal  to  the  estimates  2,  A  and  /*.  Under  the 
assumption  of  normality,  we  can  consider  the  likelihood  of  Xi,  •  •  •  ,  xN  (given  A,  2, 
and  n)  and  maximize  it  with  respect  to/i,  •  *  *  ,/at.  This  is  equivalent  to  minimizing 


i— I  i 


This  amounts  to  a  weighted  least-squares  problem  in  which  xia  —  are  the  dependent 
variates,  Xt„  are  the  independent  variates  and  fva  are  the  unknown  coefficients  [5].  The 
estimated  vector  fa  is 

(9.2)  fa  =  (A'2-1A)“1A'2“1(*a  -  m)  • 

This  estimate  has  the  usual  properties  of  a  least-squares  estimate.  It  is  unbiased  and  each 
component  of  the  estimate  has  minimum  variance  of  all  linear  unbiased  estimates.  It 
will  be  observed  that  the  coefficients  of  the  estimates  depend  on  A  and  2;  these  would 
not  change  if  a  selection  on  the  basis  of  factor  scores  was  made. 

Another  approach  has  been  suggested  by  Thomson  [21].  If  £Jf  =  I,  the  covariance 
matrix  of  X  and  /  is 

?)• 

Then  the  regression  of  /  on  X  is  A'(2  -f-  AA')~lX  =  (I  +  r)-1A'2-1X.  The  estimate 
of  fa  is 

(9.4)  /.-(/+  ry-iA's-'Cx.  -  x). 


The  i4h  component  of  this  estimate  is  7„„/(l  +  7„,)  times  the  rth  component  of  (9.2) 
when  T  is  diagonal. 

One  might  also  ask  for  an  estimate  such  that  —  f  afL  =  I .  Now  let  us  find  esti¬ 


mates  that  have  this  property  by  minimizing  ^  (xa  —  x  —  Afa)  '2-1  {xa  —  x  —  A  fa) 
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under  the  restrictions  that  fafh  =  N I  .  We  minimize 

(9.5)  2  (Xa-  X  -  Afa)'?-1  (x«-  X  -  Afa)  +  tT  0  (  ^  /«/«  “  Nl  )  , 

a=l  '  a-=l  ' 

where  0  is  a  symmetric  matrix  of  Lagrange  multipliers.  Then 

(9.6)  /«  =  (A'2-1A  +  G)-'A'2-'(xa  -  x)  , 

where  A'2-1A  +  0  is  the  symmetric  square  root  of  A'S-1^4S_1A.  If  A  and  2  are  the  esti¬ 
mates  by  the  method  of  section  7.2  then  A'2-1A2-1A  =  T(7  -f-  T)  and  (9.6)  is 

(9.7)  fa  =  [r(7  +  rjJ-^A'S-^Xa  -  x) . 

If  we  want  2/a/'  =  NM,  where  M  is  specified,  then  A'2-1A  +  0  must  satisfy 

(9.8)  (A'2-1A  +  0)Af(A'2-1A  +  0)  =  A'rM^A  . 

We  note  that  if  one  assumes  2  =  <r2 7  and  that  the  factor  vectors  are  nonrandom,  then 
X ia  ~~  fl  i  =  At vfva  and  the  variance  of  Xia  is  <r2.  Then  the  role  of  tests  and  individu¬ 
als  can  be  interchanged.  Whittle  has  shown  that  under  suitable  identification  conditions 
(including  fvafpa  =0  for  v  fi)  the  estimates  of  fva  involve  the  principal  compo- 

a 

nents  of  ^  (xia  —  x%)  (xip  —  x{)  . 

i 

PART  II.  PROOFS  OF  SOME  NEW  RESULTS 

10.  Maximum  likelihood  estimates  for  random  factor  scores  when  A  is  identified 
by  specified  zero  elements 

Theorem  10.1.  Letx  i,  •  •  •  ,  xn  be  N  observations  from  'I'),  where  'k  =  AM  A'  +  2. 

Let  rtii i  =  1  and  A =  0,  i  —  i(l,  a),  •  •  •  ,  i(pa,  a),  a  —  1,  •  •  •  ,  m.  Let 

N 

N  A  —  (xa  —  x)  (xa  —  x)  ' .  The  maximum  likelihood  estimate  of  n  is  p  —  x  and  the 

a=l 

maximum  likelihood  estimates  of  A,  M  and  2  are  given  by 

(10.1)  diag  (A  -  2  -  AilfA')  =  0 , 

(10.2)  7' A  =  0  , 

(10.3)  A'l~lA  -  A'  -  A't-'AMA'  =  (M-1  +  A'2_1A)7'2  , 

where  jia  =  0,  i  i(  1,  a),  •  •  •  ,  i(pa,  a),  a  =  1,  •  •  •  ,  m. 

Proof.  The  logarithm  of  the  likelihood  function  of  given  p  =  x,  and  divided  by 
N/2  is 

(10.4)  =  p  log  27t  —  log  | ^ |  —  tr  <SrlA  . 

The  partial  derivative  of  <f>  with  respect  to  yf/h0  (where  pha  is  not  assumed  to  be  pgh)  is 
the  g,  Ath  element  of 

(10.5) 


ip — '^4^ — 1 
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Then 


(10.7) 

(10.8) 


d<f> 

d<Xi 

d<f> 


d^oh  dffu 

d<t>  d\pgh 
dnicifi  f*  dypah  dm*? 


=  2 


=  2  \*'!'tt0'ag'h'Vh>'hft—  2  a  ^  0  , 

ff,  s',  A,  />'  0 ,  a 


d<£ 

d  \ia 


=  2 


d<£  d^A 


g>hdfahd\%a  fig 


where  ^_1  =  ('frff,‘).  Let 

(10.9)  titA'iSr'Air1  -  4r')  =  J' . 


7  =  2  V  m<*Xe/3  (  V  *"'a,  v**'*  -  *oi) 

la  H  n  \',  h’  ' 

i^i  (1,  0)  ,  •  •  •  ,  (/>„,  a)  , 


When  we  set  (10.6),  (10.7),  and  (10.8)  equal  to  0,  we  obtain 

(10.10)  diag  (4'_1v4'I'~1  —  'I'-1)  =  0  , 

(10.11)  -  4»“1)A  =  D  , 

(10.12)  jia  =  0  ,  i  7*  i(  1,  o),  •  •  •  ,  i(pa,  a),  a  =  1,  •  •  ,  m  , 


where  D  is  diagonal.  The  last  set  of  equations  states  that  J  has  0’s  where  A  is  not  speci¬ 
fied  to  have  0’s. 

Now  let  us  simplify  these  equations;  in  particular,  we  want  to  express  4'~1yl4r~1  — 
4r_1  differently.  Multiplication  of  (10.11)  on  the  left  by  M  and  (10.9)  on  the  right  by  A 
gives 

(10.13)  MD  =  J'  A. 

The  diagonal  elements  of  J' A  are 

(10.14)  ^jia\ia=  0 

i 

because  either  j $a  =  0  or  X,a  =  0.  Thus  the  diagonal  elements  of  MD  are 

(10.15)  rflaadaa  =  daa  =  0, 

and  therefore  D  =  0  =  J'k.  Multiplication  of  (10.9)  on  the  right  by  4'  gives 

( 10.16)  l&A'($rlA  -  I)  =  /'#  =  J'(t  4-  A1& A')  =  J't 
because /'A  =  0.  Then 

(10.17)  (I  +  Mt)J't  =  (J  +  MA'2rlA)MA'(*-lA  -  I ) 

=  +  A]HA')(4r1A  -  I ) 

=  &A!2-\A  -  #) . 


From  this  we  derive  (10.3).  Now  we  can  write 
(10.18)  A  —  4r  =  4'(4'“b4  —  I) 

=  (2  +  AlftA'X-tr'A  -  I ) 
=  Z(4^14  -  /)  +  l/'S 
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by  (10.16).  This  can  be  written  as 

(10.19)  A  -  4r  =  ^(^-1^^-1  -  4r1)4r  +  A/'S 

=  S(4r_1^44r-1  -  ^)S  +  tJA'  +  A/'S  . 

The  diagonal  elements  of  tJA'  and  A/'S  are  0  because  <rlt^j  \iajia  =  0  since  jia  is  0 

a 

if  Xta  is  not  0.  Then  (10.19)  implies  (10.1). 

We  can  find  another  set  of  equations  for  the  estimates  by  eliminating  /  from  (10.2) 
and  (10.3).  Multiplication  of  (10.3)  on  the  right  by  2-1A  gives 

(10.20)  A'S-MS-^A  -  T  -  fMt  =  0 
because  /'A  =  0.  Let  it  =  A'S-1ylS_1A.  Then 

(10.21)  (Z+iJ/f)  =  t-l& 

and  (10.17)  gives 

(10.22)  /'S  =  &-l?MA'±-1(A  -  4r) 

=  K-'tMA't-'iA  -  S  -  AM  A') 

=  R~lTAt(Af2~1A  -  A'  -  f  MA') 

=  (A't-1  A  -  (7  +  tM)A') 

=  R-'?&(k!tr'A  -  Rt'-^A') . 

From  (10.21)  we  also  have  Mt  =  -  7,  =  f~l  -  R~\  and  (10.22)  is 

(10.23)  J't  =  (f-1  -  £~l)(A'2-L4  -  Rt-'k')  . 


Then  the  estimates  are  defined  by  (10.1),  (10.21)  and  the  equations  where  the  elements 
of  (10.23)  are  set  equal  to  zero  if  the  corresponding  element  of  A'  is  not  assumed  zero. 

Let  us  consider  a  special  case  of  m  =  2.  We  order  the  rows  of  A  so  that 


(10.24) 

A=(o  *)• 

Then 

II 

Co 

Oj^ 

(10.25) 

We  can  write  (10.2)  as 

(10.26) 

o 

II 

^  ° 

o  ^ 

II 

<<j 

and  (10.3)  as 

(10-27)  (^ 

(A’  ON  (A'  0  N  a_1  ( A  0N/1  thn\(A'  ON 

V0  87  Vo  87  Vo  8/  V*,i  1  7  Vo  87 

“[**<  S')^(o  s)](° 

If  we  let 

(10.28) 

t  =  \ 

0  \  .  ( Aw  A i2\ 

\0  tj'  \An  Aj' 
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then  we  have 

f&'lT'Au  &'±TlA 


/ d  Si  .4 n  d'2,1  ^4 12\  /ft1  0  \  /d^i  d  ’  df  d'^i  dmi2^,\ 

(10.29)  g,2-i AJ  Vo  67  Vg's-'g^a'  S'^r'S -S'  / 

*"  (rraK+^O^l 


i-rftj. 


c't h 


(r^w+m) -t=w.3^1 


These  can  be  written 


(10.30) 


d'Sr‘An-d'~  (i nT'd)d'=  ~T=%r  <V. 
d't-'An-  (d[tT'd) *,,£'  =  (1^+ d't-\ 
b't-'An  -  t'l->uud'  =  (1^+  6'2r‘0  «v , 

SVi„-6'-8V5«'=  -i^fr  <*'V- 


11.  Estimates  for  nonrandom  factor  scores  when  AA'  is  unrestricted 
Here 

(11.1)  xa=  A/a-f  Ua+H,  a  —  1,  •  ’  ‘  ,  N  , 

(11.2)  £Xa  =  A/a  +  M  . 


We  assume 
(11.3) 


AT 


2/.  =  0, 

a*l 


iV  1  0=1 
Let 

(11.5)  (*g—  X)  (Xg~  X)  '. 


In  (11.4)  and  (11.5)  we  have  altered  previous  definitions  by  replacing  N  by  N  —  1. 
Then  ( N  —  1)4  has  the  noncentral  Wishart  distribution  with  covariance  matrix  2, 
means  matrix  AM  A',  and  N  —  1  =  n  degrees  of  freedom.  Let  k\,  •  •  •  ,  km  be  the  non¬ 
zero  roots  of 

(11.6)  |  AM  A'  —  &24-12  [  =  0 . 

Then  [3]  the  likelihood  function  of  A  can  be  written2 

(11.7)  C  |  S  |  ~n/2 1  A  |  (n-p-l)/2e-n/2  tr  Zr*A -n/2  tr  2/A'2T>AJ  |  J  _  |  (n-2m-l)/2gn  tr  W&Z  , 

2  In  [3]  the  roots  should  have  been  defined  by  the  equation  |  T  —  XSA-1S  |  =  0. 
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where  the  integration  of  the  m  X  m  elements  of  Z  is  over  the  range  1  —  ZZ'  positive 
definite  and 


(11.8) 


#1/2  = 


Vki 

o 


o 

y/Ti 


0  ' 
0 


V  k , 


We  now  take  M  —  1.  The  indeterminacy  remaining  in  A  we  remove  for  this  particular 
sample  by  requiring  that  be  diagonal.  Then 

(11.9)  R  =  k'tr'At-'h  . 

If 

(11.10)  \nF{K)  =  log  J  |  /  -  ZZ'  I  <«-»—«/**•  tr  KU>ZdZ  # 


then 

(11.11)  -logX  =  logC  +  log|S-1|  +  ^— log  |  il|  -tr  2~lA 

ft  ft 

-tr  A'S-!A  +  F(K). 

The  partial  derivatives  with  respect  to  au,  the  elements  of  2-1,  and  Xto,  the  elements 
of  A,  are 

d  —  log  L 

(11.12)  — - - <r„- ««-2^+22F‘-(JS;)  £**•***"«« 

d  —  log  L 

(11.13)  - =  —  2<r”X,a+  2F*0  (AT)  V  a 

(J  At  a 


We  set  these  derivatives  equal  to  0  to  define  the  maximum  likelihood  estimates.  These 
derivatives  set  equal  to  0  give 

(11.14)  diag  2  =  diag  (A  -f  AA'  -  2AF*A'2rlA), 

(11.15)  2~lK  =  £-lA2-'kF*  , 

where  F*  is  the  diagonal  matrix  composed  of  the  partial  derivatives  of  F(K).  When  we 
multiply  (11.15)  on  the  left  by  A',  we  obtain 

(11.16)  f  =  RF*, 
which  shows  that  f  is  diagonal.  Then  (11.15)  is 

(11.17)  Af-1^  =  ,<42-1A  . 

From  (11.14)  and  (11.15)  we  obtain 

(1 1.18)  diag  2  =  diag  (.4  +  AA'  —  2AA')  =  diag  (.4  —  AA') . 

The  estimates  are  then  defined  by  (11.9),  (11.16),  (11.17),  (11.18),  the  requirement  that 
R  be  diagonal  and  the  definition  of  the  diagonal  elements  of  the  diagonal  matrix  of  F*as 
the  partial  derivatives  of  (11.10)  with  respect  to  the  diagonal  elements  of  K  =  R. 

Equation  (11.17)  indicates  that  the  diagonal  elements  of  (F*)-1  =  t~lR  must  be  m 
roots  of 

(11.19) 


\A-6±\  =  0, 
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whereas  in  section  7.2  the  m  roots  are  the  diagonal  elements  of  I  -f-  f.  F*  is  a  compli¬ 
cated  function  of  R  =  A/2-1,4l)~'1A. 

We  can  show,  however,  that  the  estimates  given  in  section  7.2  differ  from  the  solution 
of  the  above  equations  (for  a  given  A)  by  amounts  smaller  than  order  l/Vn-  To  show 
this  we  make  an  asymptotic  evaluation  of  F*. 

Theorem  11.1.  For  a  given  positive  definite  diagonal  matrix  H 


(11.20) 


lim 


V»  |~ 


r^(^) 

-  1+  Vl  +  4  ha 

a 

-Si 

! _ 1 

2  ha 

uniformly  in  H  for  H  in  a  bounded  set ,  where  ha  is  the  a  th  diagonal  element  of  H  and  F(H ) 
is  defined  by  (11.10). 

The  proof  of  this  theorem  is  too  complicated  to  give  here.  Using  this  result  in  (11.16) 
we  have 

(11.21)  Vn{f-  [-/+(/  +  4£)1/2]/2}-»0 
uniformly  for  R  in  a  bounded  set.  Now  if  we  replace  (11.15)  by 

(11.22)  f  =[-/+(/  +  4£)V2]/2 


the  solution  of  the  equations  will  differ  from  the  previous  solution  by  amounts  smaller 
than  order  1/a/»  (because  the  solutions  are  continuous).  However,  (11.22)  is  equivalent 
to  ft  —  f  (/  +  f )  and  then  the  solution  is  the  one  of  section  7.2.  The  errors  are  uniformly 
of  order  o(l/\/n)  for  A  in  a  bounded  set  and  for  large  enough  n  the  probability  that  A  is 
in  a  set  including  'k  is  arbitrarily  near  one.  Hence 

Theorem  11.2.  If  A  converges  stochastically  to  a  positive  definite  matrix,  then  Vn  times 
the  difference  between  the  estimates  of  A  and  2  defined  in  this  section  and  the  estimates  de¬ 
fined  in  section  7.2  converge  stochastically  to  0. 


12.  Asymptotic  normality  of  estimates 

In  this  section  we  shall  prove  that  y/~N (A  —  A)  and  y/ N(2  —  2)  defined  by  (7.4)  to 
to  (7.7)  are  asymptotically  normally  distributed  when  A  converges  stochastically  to 
Sk  =  2  +  AA'  and  V  N(A  —  '$')  is  asymptotically  normally  distributed.  In  particular, 
this  is  true  when  the  observations  x\,  x2,  '  '  '  are  drawn  from  N(u,  4r).  Let 

(12.1)  4>  =  2  -  A(A,S-1A)“1A' . 

Theorem  12.1.  If  j  <{>*,■  |  5*  0,  where  is  defined  by  (12.1),  if  A  and  S  are  identified 
by  the  condition  that  A'2-1A  is  diagonal  and  the  diagonal  elements  are  different  and  ordered, 
if  A  converges  stochastically  to  and  if  y/N(A  —  4')  has  a  limiting  normal  distribution, 
then  ViV( A  —  A),  y/~N(2  —  S)  defined  by  (7.4)  to  (7.7)  have  a  limiting  normal  distribu¬ 
tion. 

Proof:  First  we  show  that  A  converges  stochastically  to  A  and  2  converges  stochasti¬ 
cally  to  S.  The  estimates  A,  2  are  defined  as  the  matrices  satisfying  A^S*-^*  being 
diagonal  and  maximizing 

(12.2)  /U,2*,A*)  =  log  L(A,  2*,  A*)  =  -*>log  2ir  -  log  1 2*  +A*A*  | 

-tr  A  (2*  +  A*A*)  -1  . 
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Now 

(12.3)  f(A,  2*,  A*)  — >/(AA'  +  2,  2*A*) 

uniformly  in  probability  in  a  neighborhood  of  2,  A  and /(A A'  +  2,  2*,  A*)  has  a  unique 
maximum  at  2*  =  2,  A*  =  A.  Because  the  functions  are  continuous,  the  2*  A*  maxi¬ 
mizing/^,  2*,  A*)  must  converge  stochastically  to  2,  A. 

To  prove  the  theorem  we  need  only  to  prove  that  A  and  t  are  functions  of  A  that 
have  continuous  first  derivatives  in  a  neighborhood  of  A  =  Since  the  equations  de¬ 
fining  A  and  t  are  rational  functions  of  A,  1  and  A  set  equal  to  zero,  they  are  polynomial 
equations;  the  derivatives  will  be  continuous  unless  they  become  infinite.  The  remainder 
of  the  proof  is  to  show  that  they  do  not  become  infinite  (at  A  =  1Jr). 

A  and  t  are  defined  implicitly  as  functions  of  A ,  say  by 

(12.4)  H(\,  a,  a)  =  0, 

where  X  is  A  arranged  in  a  vector,  a  is  2,  arranged  in  a  vector  and  a  is  A  arranged  in  a 
vector. 

The  solution  to  (12.4)  is  X  =  X(a),  a  =  &(a).  Then 

( 1 2.5)  &(a),  a]X„(a)  +  #?[X(a),  or  (a),  a]ca{d)  -f  Ha(\{a),  &(a),  a]  =  0  , 

where  H%  is  the  matrix  of  partial  derivatives  of  the  components  of  H(\,  <7,  a)  with  re¬ 
spect  to  the  components  of  X,  XB(a)  is  the  matrix  of  partial  derivatives  of  the  components 
of  \(a)  with  respect  to  the  components  of  a,  etc.  We  need  to  show  that 

( 1 2.6)  .ffx(X,  a ,  ^)Xa(^)  -f-  i?a(X,  <r,  ^)ca{\p)  +  Ha(\  <r,  ^)  =  0 

can  be  solved  for  Xa(^),  o-a(^).  Our  method  of  computation  is  to  expand  H(\,  a,  a)  = 
H(\  -f-  /,  <r  +  s,  yf/  +  a*)  in  terms  of  l,  s,  and  a*,  consider  only  linear  terms,  and  show 
(under  the  conditions  of  the  theorem)  that  the  resulting  linear-equations  can  be  solved  for 
l  and  s  in  terms  of  a. 

Let  A  =  A  +  L,  2  =  2  +  S,  A  =  2  +  AA'  +  A*,  t  =  T  +  G.  Then  (7.5)  can  be 
written  in  linear  terms  as 

(12.7)  diagA  =  diag  {A*  —  AX'  —  LA') . 

Since  =  (2  +  S)~l  is  2-1  —  2-1S2~l  to  linear  terms  in  S,  (7.6)  can  be  written  (in 
linear  terms)  as 

(12.8)  r  +  G  =  T  +  L'I~lA  +  A'2 'r'L  -  A/2-152-1A  , 
and  (7.8)  can  be  written  (in  linear  terms)  as 

( 1 2.9)  Ar  +  A L'IrlA  +  AA'2_1Z,  -  AA'2_152_1A  +  LT 

=  AA'2-XA  +  AA'2-X  -  AA'2_152~1A  +  A*Ir'A  -  Sir1  A  . 
Then  (12.9)  and  (7.7)  can  be  written  (in  linear  terms)  as 

( 12. 10)  LT  +  AX'2_1A  +  S^A  =  A*ZrlA , 

(12.11)  nondiag  (L'I~lA  +  A 'IrlL  -  A'2~152"1A)  =  nondiag  0  . 

We  now  show  that  (12.7),  (12.10)  and  (12.11)  can  be  solved  for  S  and  L  under  the 
conditions  of  the  theorem.  From  (12.10)  we  derive 

(12.12)  L  =  A  *2-1Ar~1  -  52~1Ar~1  -  AL'Z-'AT-1 
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and 

(12.13)  LA'  =  A*Z~XH  -  SZ~XH  -  AL'Z~H , 
where  H  =  Ar-1A'.  Using  (12.13)  we  have 

(12.14)  A*  —  LA'  —  AZ/  =  A*  —  A*Z~lH  +  SZ~XH  +  A L'Z^H 

-  HZ-1  A*  +  HZ~lS  +  HZ-HA' 

=  A*  -  A*Z~XH  +  SZ~XH  +  AL'Z~XH 

-  HZ-1  A*  +  HZrlS  +  HZr\A  *Z~XH 

-  SZ-H  -  AL'Z-'H) . 

Since  HZ~XA  =  A,  (12.14)  can  be  combined  with  (12.7)  to  give 

(12.15)  diag  [(2  -  H)ZrxS Z~X(Z  -  H)]  =  diag  [(2  -  H)Z~XA  *Z~l(Z  -  H)} 


or 

(12.16)  diag  =  diag  $ Z~XA  *2~1<l> . 


The  ith  component  equation  is 


(12.17) 


J=1 


a,  a— 1 


„  * 
O'oh 


<l>ih  • 


This  can  be  solved  if  the  matrix  S  with  elements  £,7  =  $,•  is  nonsingular. 

Equation  (12.11)  can  be  written  as 

(12.18)  nondiag  (Q  +  Q')  =  nondiag  V  , 

where  Q  —  L'Z~lA  and  V  =  A,Zr1S2r1A.  Multiplication  of  (12.10)  on  the  left  by 
A'2-1  gives 

(12.19)  Q'T+  TQ  =  U  -  V, 

where  U  —  A,ZrlA*Zr1A.  The  a,  ath  equation  of  (12.19)  yields 


(12.20) 


Waa  Vaa 


and  the  a,  /3th  equations  of  (12.18)  and  (12.19)  yield 


Moon  „  Uafi—  Vafi—  Vopy &  ,  0 

(12.21)  q<#  = - ,  a^p. 

7«a  —  700 

Substitution  of  Q  =  L'Zr1  A  in  (12.12)  gives  L  in  terms  of  A*  and  S.  This  proves  the 
theorem. 

From  the  above  formulas  we  can  find  the  asymptotic  variances  and  covariances  of  the 
estimates.  When  the  observations  are  drawn  from  a  normal  distribution  with  covariance 
matrix  'F, 

(12.22)  J  £  (dgh  ~  tak)  (Okl  ~  'f'kl)  =  Y  £a*ha*i 


—  'f'ak'l'kl  +  fal'l'hk  . 
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and  this  is  the  limiting  covariance  of  A*k  and  A*t.  The  solution  of  (12.17)  can  be  ex¬ 
pressed  as 


(12.23) 


°kk  i,  g,  h 


_  * 
Qgh 

GggGhh 


<t>ih  , 


where  (£fci)  =  S-1.  Then  it  is  a  straightforward  computation  to  find  that  the  limiting 
covariance  of  s**/ c\k  and  sgo/ c\g  is  2  This  shows  that 

(12.24)  lim  N&  {&kk— Okh)  (og0— v0o)  =  2^0<xgffagg . 

N—KX> 

The  limiting  covariances  of  elements  of  A  are  complicated  and  depend  on  the  identifi¬ 
cation  conditions.  We  give  one  formula  to  show  how  involved  such  expressions  are  and  to 
show  that  they  are  similar  to  formulas  obtained  in  other  problems  involving  characteris¬ 
tic  vectors  ([16],  for  example).  Let  X„  be  the  vth  column  of  A  and  X„  the  i>th  column  of  A. 
Then 

(12.25)  yvv  lim  N£(\>-  X„)(X„-  X,)'  =  Af,2~1  (*+4 X.X)  2~,M„ 

JV— »oo 

+  Pv  (\kyZk°\gy)Pp  , 

where  (\ky^ka^0y)  is  a  matrix  with  indicated  elements  and 

(12.26)  =  2  — X„x:-  V - 1 - XaX', 

(12.27)  P*  =  2  -  - - (I+7-)  S - - - X<X. 

2y»»  i£y«a-yrr 


The  method  of  estimation  of  section  7.4  (principal  components  of  A  —  t)  can  be 
studied  in  a  similar  fashion  and  one  can  derive  the  following  result : 

Theorem  12.2.  If  1 6%  \  9^  0  where  0  =  7—  A  (A' A)-1  A',  if  A  is  identified  by  the  con¬ 
dition  that  A' A  is  diagonal  and  the  diagonal  elements  are  different  and  ordered,  if  A  con¬ 
verges  stochastically  to  4%  and  if  y/~N(A  —  S^)  has  a  limiting  normal  distribution,  then 
Vn(A  —  A),  \^N(2  —  2),  defined  by  (7.22)  to  (7.25),  has  a  limiting  normal  distribution. 

We  can  also  prove  that  when  A  is  identified  by  requiring  certain  elements  to  be  0  then 
the  estimates  are  also  asymptotically  normally  distributed.  Instead  of  requiring  mu  —  1, 
we  can  normalize  each  column  of  A  by  a  restriction  on  that  column  (for  example,  re¬ 
quiring  X„„  =  1,  v  —  1,  •  •  *  ,  m,  if  none  of  these  is  specified  to  be  0).  Then  all  of  the  re¬ 
strictions  are  on  A.  This  is  desirable  because  then  we  can  compare  populations  that 
differ  only  in  M  (see  section  7.9)  and  we  can  compare  the  case  of  random  and  nonrandom 
factor  scores.  We  can  then  prove  a  striking  and  powerful  theorem.  Let 


(12.28) 

M(N)  l/--/(A0)[/.-/(A0]' 

™  o-l 

(12.29) 

<tu(N)  =4SlM’a“"‘(iV)12’ 

^  0-1 

where 

/(iV)=42^-  =42““’ 

a-l  iV  o=l 


(12.30) 
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and 

(12.31)  bij(N)  =T7^S  [u«-*i(N)][u]a-ui{N)],  i*j, 

and  let  2 (TV)  be  the  diagonal  matrix  with  (12.29)  as  elements. 

Theorem  12.3.  If  A  is  identified  by  specified  zero  elements ,  if  the  identification  and 
normalization  is  by  restrictions  on  A,  if  M(N )  and  2(iV)  approach  limits  in  probability 
and  if  bij(N)  have  a  limiting  joint  normal  distribution  with  zero  means ,  then  >/n(  A  —  A), 
\/~N[M  —  M(N)}  and  '\/~N[£  —  2(iV)]  are  asymptotically  normally  distributed ,  where 
A,  $[,  2  are  the  maximum  likelihood  estimates  of  section  7.6  normalized  by  the  restrictions 
on  A.  If 

(12.32)  lim  £bij  ( N )  bH  (N)  =  plim  cr,-,-  ( N )  era  ( N )  ,  i—  k,  j  —  l 

N — >°°  JV— >00 

X  l  ■y  J  k  y 

—  0  ,  otherwise  , 

then  the  parameters  of  the  limiting  normal  distribution  of  the  estimates  depend  only  on  A, 
plim  M(N)  and  plim  2 (TV). 

The  proof  of  this  theorem  is  too  involved  to  give  here.  It  should  be  noted  that  if  /  and 
U  are  normally  distributed,  the  theorem  holds. 
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